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PREB\4CE TO THE REVISED EDITION 


During the fourteen years that have elapsed since the first edi- 
tioii of this book was published there has been a very considerable 
extension of the use of statistical methods in business, in public 
administration, and in all the social sciences. The pressing require- 
ments of new tasks and new problems, together with increasing 
knowledge of statistical procedures on the part of administrative 
and research workers, have contributed to this extension. With 
this development, the older controversies over qualitative versus 
quantitative methods have largely been shelved. It is clear that 
different problems call for different procedures; that the men who 
are grappling with research problems differ, as regards the methods 
of analysis they find congenial and fruitful; that indiictioii and 
deduction are complementary phases of the processes that lead to 
scientific advance. Tlie choice of research procedures does not 
necessitate the acceptance of one method and the rejection of an- 
other; it calls for the finding of a blend of methods that is adapted 
to a particular set of problems, and that is suited to the tempera- 
ment and abilities of the human agent that employs them. For 
workers dealing with social and economic relations, statistical 
methods constitute an essential element of this blend. Knowledge 
of systematic procedures for handling quantitative data, and skill 
in their use, are necessary parts of the equipment of students of tlie 
social sciences and of public and private administrators who must 
utilize the facts of experience in the formulation of policies. 

Gains on this front have been paralleled by notable improve- 
ments in statistical techniques. The post-war years have witnessed, 

' in , this, field, the initiation of such another period of intellectual 
ferment and creative activity as that which, earlier, brought the 
:great contributions of Karl Pearson and' his associates. The older 
'■ instruments of quantitative analysis have been refined and sharp- 
ened; methods' of designing statistical experiments and formulat- 
ing and testing diypotheses have been improved; statistical infer- 
ence has ...been placed on a, sounder foundation.' There can be no 
doubt that: these continuing improvements in the logic and in the 
technique of statistics will contribute-, in important ways to the 
advance of the. social sciences and- to the betterment of pub,lic and 
private administration., , 



viii PREFACE TO RE VISED , EDITION 


Ie preparing the present edition, of Statistical Methods account 
lias been taken, of the more important of the 'recent deiTlopmerits 
that have a bearing on the economic and biisi.n.ess applications of 
statistics. In doing this I have sought to retain the main features 
of the first edition. A systematic development of tlie fundamentals 
of statistical method is needed by the beginning student. A work- 
ing compe.nd.iiiiii of procedureSj with necessary aids to calculation 
and reference tables, is required 'by the statistician engaged i.!i ad- 
ministration or research. The book is designed to meet these two 
needs. 

The eighteen chapters of the present edition fall into two main 
divisions. The first twelve chapters deal with the descriptive as- 
pects of statistics. Induction and sampling are purposely omitted 
in this development of basic descriptive procedures. Problems of 
statistical inference, with ceidain more advanced aspects of statis- 
tical description, are discussed in the last six chapters, and in 
appendices A to E. This organization is, I think, well adapted to 
the needs of instruction. Some teachers may, indeed, prefer to 
introduce at an earlier point the concepts of samples and parent 
populations and the treatment of sampling errors. If so, selected 
pages from the chapter on elementary probabilities and the normal 
curve (Chapter XIII) and from the introductory chapter on induc- 
tion (Chapter XIV) may follow Chapter V in the sequence of 
study. 

In the chapters added to this edition I have sought to exemplify 
economic applications of the newer methods of analj^sis. These 
methods offer rich and, as yet, largely unexplored possibilities to 
research workers in the social sciences. In these ..sections I ha\'e 
drami heavily on the path-breaking work of R. A. Fisher. I am 
indebted to Dr. Fisher axrd his publishers, Oliver and Boyd of 
Edinburgh, for permission to include in this book the tabiilations 
that appear in certain of the Appendix Tables. These, with the 
other tables included, are designed to make the present book a 
reasonably complete working manual adapted to the needs of both 
laboratory worker and student. 

I must reaffirm my thanks to those who assisted me in various 
ways in the preparation of .the first edition.- I am indebted, in 
addition, to Jacob M. Gould, Agnes: B. Omundson, and AVilliam H. 
Mills for valuable aid in the details of the revision. 


May, 1938. 


F, e.M. 



PREFACE TO THE FIRST EDITION 

The last decade has witnessed a remarkable stimulation of 
interest in quantitative methods in business and in the social 
sciences. The day when intuition was the chief basis of business 
judgment and unsupported hypothesis the mode in social studies 
seems to have passed. Following the lead of workers in the older 
and traditionally more accurate physical sciences, social scientists 
and serious students of business are employing in greater measure 
than ever before a method of study based upon the observation 
and analysis of fads. When these observations are quantitative 
in character appropriate methods are necessary for their organiza- 
tion and interpretation. This book deals with methods of com- 
bining and analyzing such observations, with primary emphasis 
upon materials drawn from the fields of economics and business. 

The justification for limiting the treatment to these particular 
fields is two-fold. Although general statistical methods are prac- 
tically universal in their application, special problems are eii- 
coiintered in every field of study. This is particularly true in the 
realm of economics, which presents many distinctive difficulties 
and many characteristic problems. Methods that are in some 
degree specialized to meet these particular requirements have 
been developed, and these methods call for treatment in a work 
that is restricted in scope. In the second place, methods can 
be most effectively explained in terms of particular subjects; ab- 
* stract methodology is barren of interest to the average person. 
For these reasons the book has been written with reference to the 
specific, needs of quantitative workers in ecoiioniics and business. 

Ill , the explanation of methods no attempt has been made to 
secure' the brevity of exposition which may be desirable in a 
. strictly mathematical work. The purpose throughout has been 
to write for the learner not for .the finished .master, and the expla- 
/. ; nations, have been prepared with the needs' of the former in mind. 
; I have felt free to omit certain detailed demonstratio,iis of tlieorenis 
. becaus,e, this book is presented asun introduction to the subject., 

■ not as an exhaustive treatise. ■ 

.The methods of .quantitative' analysis that are in general, use 
today represent a long accretion, an accumulation of contribii- 
. tions from wo.rkers, in .many fields. . It would be vain to attempt to 

' ' ' . ■ 



X; PREFACE TO- THE FIRST EDITION 

enumerate all the individuals who have contributed to tlie develop- 
ment of the science of statistics. Individual references are 
in particular cases in the body- of the text, but no list of such ac- 
kiiowledgnients can serve as a complete record of the debt modern 
statisticians owe, to their predecessors. 

For assistance in the preparation of the material contaiiif^d in 
this -book I am under many obligations. To M'r. H. E. Anderson 
and Professor H. B. Killough I am indebted for certain of the data 
employed in Chapters XI, XVI, and XVIL Profcissor Warren M. 
Persons of the Harvard Committee on Economic Research has 
courteously permitted me to make use of certain results of his work 
on commodity price index nmnbers. The index of industrial activ- 
ity presented in Chapter IX and utilized in Chapter XI is a product 
of the Statistical Di\dsion of the American Telephone and Tele- 
graph Company. I have employed it with the kind permission of 
Mr. Seymour L. Andrew, Chief Statistician. Suggestions from 
Professor A. H. Mowbray of the University of California have en- 
abled me to remove several obscurities that were present in an 
earlier mimeographed edition, I am deeply grateful to Professors 
Henry L. Moore, Theodore H. Brown, and Henry Schultz for their 
help ill critically reviewing portions of the manuscript. I'or assist- 
ance at every stage of the work involved in the writing of this book 
I am under deep obligation to Professor Donald H. Davenport. 
His aid in the collection of material, in the preparation of charts, 
and in the onerous task of seeing the book through the press has 
been invaluable. To my wife, above all others, I am indebted 
for a measure of constant and generous help that cannot be ade- 
quately acknowledged here. 


'November, 1924. 


F, C. M. 
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CHAPTER I 


STATISTICAL METHODS AND THE PROBLEMS 
OF ECONOMICS AND BUSINESS 

The distinction between economics and business rests upon 
viewpoint and approach, rather than subject matter. The 
economist and the business man have different objectives, 
but the substance of the science of economics and the mate- 
rials with which the art of business administration deals 
are in large part the same. In this treatise we are con- 
cerned with methods that may be employed in handling this 
common subject matter. 

Classes op Business Activity 

The tasks that confront business men may, without undue 
straining, be placed in three classes. First, in logical se- 
quence, are the technical tasks that arise in the processes 
of production, involving problems of chemistry and physics, 
of engineering, of animal husbandry, of navigation. The 
basic technical knowledge called for in the solution of these 
problems fui’nishes the foundation of our economic life. 
This is the domain of the hard-won arts of handling the 
raw materials and controlling the forces of nature. 

In the second class come activities that are connected 
with the internal organization and administration of indi- 
vidual business units. The technical functions of manipulat- 
ing organic and inorganic matter for the satisfaction of 
human wants are performed through administrative units, 
single farms, mines, factories, railroads, department stores. 
A whole new division of problems is faced by the business 
man in organizing these units, in coordinating the work of 
the different departments, in supervising the daily activities 

I 



2 STATISTICAL METHODS 

of the individuals making up each organization. While these 
are perhaps less fundamental than the technical problems of 
production, they are, for the average business man, more 
pressing and more difficult. Scientific method has made 
less progress in solving these latter problems. There is not 
the organized body of knowledge which is found in the 
former field, nor are there the same trained experts to whom 
the tasks may be delegated. 

The two types of economic activity named above include 
tasks that are in a sense self-centered and controllable. The 
manufactiurer of steel has his technical problems of smelt- 
ing and refining, his particular administrative duties. The 
farmer or mine-owner faces the same types of problems, in 
forms peculiar to his own situation. In the performance of 
tasks in these fields each man is dealing with problems all 
the elements of which are under more or less perfect control. 
Difficulties arise, but these are ordinarily difficulties inherent 
in the given task, not difficulties arising from a sudden change 
in the constituent elements of the problem, or the sudden 
interjection of a new factor. In this respect the third cate- 
gory of tasks to be performed by the business man differs 
materially from the first two. For this class is composed of 
problems the elements of which are subject only in part 
to control by the individuals directly concerned. 

This third division includes buying and selling, and all 
the attendant activities that are carried on in terms of 
prices. As economic life is at present organized these func- 
tions are, to the business man, the most important ones he 
performs. The technical tasks of production and of internal 
organization and administration are but means to an end. 
For the business man the goal of economic acthdty is the 
disposal of his product at a profit. The tasks preliminary to 
this final sale are of necessity subordinated to it, and so 
performed that the final aim may be achieved. The point 
of emphasis here is that the business man, in buying and 
selling, faces problems containing elements which he cannot 
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control. In securing his raw material, in bringing together 
the other agents needed in production, and in the final dis- 
posal of his product, the business man deals with markets — 
commodity markets, labor markets, money markets — - and 
finds himself acting in relation to a system of prices quite 
beyond his control in its major movements. The other 
less fundamental phases of his activity are subject to a high 
degree of control, but when the business man comes to the 
final and most important act, the profitable sale of his prod- 
uct, his power of control dwindles. The motivating force in 
business activity is the hope of pecuniary profits, pecuniary 
profits depend upon successful bujdng and selling, successful 
buying and selling depend upon favorable conditions in an 
uncontrollable world of prices — here is the argument that 
states the major problem of business. And these are the 
facts which make the price system the dominating and 
all-important factor in modern business life. 

The modern entrepreneur lives in an environment of 
prices. The term “environment” is not an unapt figure; 
this world of prices in which the business man functions 
constitutes a coherent, consistent, well-articulated system 
of interdependent parts, a system which encompasses all 
the business activities of the entrepreneur. Since the system 
is beyond the control of the individual he must adapt him- 
self to it, and must base his actmties upon as complete an 
understanding of the system as he may obtain. Without 
this understanding the major problems of business are in- 
capable of solution. 

Quantitative Chaeacter op Economic and Business 

Problems 

Problems falhng in the first of the classes outlined above 
have long been recognized as essentially quantitative in 
character. Their solution calls for the application of the 
methods of precision which have been developed in the 
physical sciences. It is no less true that the strictly eco- 
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nomic and business problems falling in the other classes 
require the employment of quantitative methods. Quali- 
tative considerations enter, of coui-se, in the solution of such 
problems, helping to determine the questions to be asked 
and the methods to be employed. But facts, measured, 
weighed and compared with other facts, constitute the Ixasis 
of business judgments and the foundation of economic rea- 
soning. Statistical methods provide means of organizing 
and appraising these facts. 

Of the three classes of problems distinguished in the pre- 
ceding section two come within the scope of the present dis- 
cussion. Though the methods of statistics are in part ap- 
plicable to the solution of technical problems of production, 
it is not the purpose of the present work to develop this 
subject. For the solution of problems in the two other 
fields — those connected with the internal organization and 
administration of business units and with the processes of 
buying and selling that bring the business man into contact 
with the price system ^ — methods of statistical analysis are 
peculiarly appropriate. 

Statistical Methods and Problems op Internal 
Administration 

The typical business man, in the administration of his 
organization, is called upon to deal with masses of measure- 
ments. He is dealing with tons of coal, cubic feet of gas, 
or kilowatt hours of energy consumed; with tons of pig iron 
or pairs of shoes produced; with machine hours and man 
hours; with wages, costs of production and selling prices 
expressed in dollars and cents. With the increasing size of 
the business unit the data with which the administrator 
must deal become increasingly complicated and numerous, 
and it becomes increasingly difficult to determine their true 
significance. Under intuitive or rule-of-thumb methods of 
administration it is impossible effectively to analyze large 
masses of data and to control business units above the 
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average in size. It has been abundantly demonstrated that 
the law of decreasing returns comes into play in business 
largely because of administrative difficulties. 

Whenever one deals with masses of data the problem is 
one of condensation and analysis — condensation and sim- 
plification in order that it may be possible for limited human 
faculties to handle the data, analysis (and comparison) in 
order that the elements of the problem may be distinguished 
and their significance appreciated. Statistical methods have 
been developed to facilitate the condensation and analysis 
of masses of quantitative data. 

As a typical example of such a problem may be mentioned 
the allocation of costs, an operation which has been called 
cost accounting. The proper analysis of all the factors 
which enter into this problem is only possible through the 
use of statistical methods. Accounting methods, restricted 
to the treatment of pecuniary units, are inadequate for the 
complete analysis of the items of expense. The analysis of 
sales records, again, calls for the condensation of masses of 
data, their representation in simple, understandable form, 
and their interpretation in relation to other business meas- 
urements. The analysis of markets and the study of purchas- 
ing records and commodities require the use of quantitative 
methods not restricted in their application to any one class 
of measurements. At every hand in internal administration 
statistical methods may be used to supplement accounting 
methods, to extend the knowledge of the executive, and to 
make more effective the control of business operations. 

Statistical Methods and Exteen al Peoblems 

New problems are encountered when the business man 
goes into the market to buy or sell. Continually before him 
are the phenomena of business cycles, and if he is to adapt 
his producing and marketing policies to the swings of the 
cycle he must undertake the analysis of these phenomena, 
employing tools appropriate to the task. Again, the price 
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system, the movements of which are of such fundamental 
interest to the business man, requires analysis tlmough the 
use of quantitative methods. So complex and numerous are 
the data to be dealt with here that simplification is impera- 
tive. Apart somewhat from the immediate interests of the 
business man, but of dominant importance to the economist, 
are all the problems connected with the economic process of 
distribution, the allocation of income and wealth among 
the agents of production. These, as well as that other great 
economic problem concerned with the question of value or 
price determination, are quantitative problems, to be solved 
through the use of quantitative methods of research. 

Statistical Procedures in Research 

What are these methods, and wherein does research em- 
ploying such methods differ from other types of research? 
Scientific inquiry, whatever its particular method may be, 
proceeds through careful observation, logical inference and 
accurate verification. Quantitative methods differ from 
others only in that observation, inference, and verification 
are based upon measurement. Until measurement is possible 
in a science it is unavoidable that its observations and find- 
ings should lack precision, no matter how brilliant the flashes 
of intuition nor how painstaking the labors of its students 
may be. The employment of methods of measurement, 
making possible the analysis of the factors involved in terms 
of precise units, gives to a science some of the advantages 
that sharp-edged tools have over blunt and imreiiable instru- 
ments. Mathematics and its offspring, statistics and ac- 
counting, are the powerful instruments which the modern 
economist has at his disposal, and of which business, through 
the development of research agencies and methods, is making 
constantly greater use. 

The tools of the statistician are merely certain mathe- 
matical methods, developed for particular types of research. 
These types of research were not economic in the original 
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development of statistical methods, but social, political, and 
anthi'opometiic, with one line of development (that relating 
to the theory of probabilities) extending back through the 
field of logic to the gaming table. Yet these tools, developed 
for work in restricted spheres, have been found to possess 
much wider applicability, and economics has been one of 
the newer fields in which the application of these methods, 
with appropriate alterations and additions, has had fruitful 
results. The economist has found his hand strengthened 
and the precision of his work materially increased by the 
new tools. And business, together with the more abstruse 
science of economics, has profited. 

Reference has been made to the possibility of condensa- 
tion and simplification through the use of statistical pro- 
cedures. Such simplification is of cardinal importance in 
economics and in the other social sciences today. These 
sciences, to be realistic, must be scrupulously faithful to 
fact, yet the masses of facts relating to current social proc- 
esses are, in their magnitude, almost a menace to effective 
analysis. “Already,” writes a reviewer in the Journal of the 
Royal Statistical Society, “economic analysis taxes language 
to its utmost, and it is a question how much longer mere 
verbal exposition will be able to control the swelhng floods 
of observable data.” Though one may feel that these floods 
of data fail to provide many of the essential facts about 
social processes today, there is point to the reviewer’s com- 
plaint. In the light of this danger systematic procedures 
in the orgardzation and analysis of data have an importance 
today that they did not have at an earher time. Statistical 
methods constitute such procedures. By their use we may 
seek to channel and appraise the floods of data, relating to 
business operations and other social processes, that the 
fact-gathering agencies of business and government currently 
release upon us. 
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GRAPHIC PRESENTATION 

The explanation of methods of condensing, analyzing, and 
interpreting the facts of business and economies must start 
with the discussion of some fundamental considerations 
which are mathematical rather than statistical in chai-acter. 
In doing so it is deemed advisable, even at the risk of tread- 
ing quite familiar ground, to explain certain mathematical 
conceptions to which constant reference will be made in 
later chapters. 

Statistical analysis is concerned primarily with data based 
upon measurement, expressed either in pecuniary or physical 
units. The methods of coordinate geometry, developed first 
by the philosopher Descartes, greatly facilitate the manipu- 
lation and interpretation of such data. A summary of the 
basic principles of coordinate geometry will not be out of 
place. 

Rectangular Coordinates 

If two straight lines intersecting each other at right - 
angles are drawn in a plane, it is possible to describe the 
location of any point in that plane with reference to the 
point of intersection of the two lines. We wall call one of 
the lines (a vertical line) F'F, the other line (horizontal) X'X, 
and the point of intersection (or origin) 0 (cf. Fig. 1). If P 
be any point in the plane, we may draw the line PM, par- 
allel to Y'Y and intersecting X'X at M, and the line PN, 
parallel to X'X and intersecting Y'Y at N. If we set OM 
equal to g units and ON equal to h units, g and h constitute 
the coordinates of P, describing its location with reference 
to the origin 0. Thus, in Fig. 1, g equals 6 and h equals 5. 

8 
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The distance g' along the .t-axis is termed the abscissa of 
the point P, while the distance h along the ^/-axis is termed 
the ordinate of the point P. (It is a rule of notation always 
to give the abscissa first, followed by the ordinate.) The 
coordinates of any other point in the same plane may be 
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Fig. 1. — Location of a Point with Reference to Rectangular Coordinates 


determined in the same way. Conversely, any two real 
numbers determine a point in the plane, if one be taken as 
the abscissa and the other as the ordinate. 

A point may lie either to the right or left or above or 
below the origin, O. It is conventional to designate as 
positive abscissas laid off to the right of the origin, and as 
negative abscissas laid off to the left of the origin, while 
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ordinates are positive when laid off al,>o\’{? the origin and 
negative when laid off below the origin. In goicral, the 
values to be dealt with in economic .statistie.s lie in the 
upper right-hand quadrant, where both abscissa and ordinate 
are positive. 

This conception of coordinates is fundamental in mathe- 
matics and of basic unportance in statistical woi-k. A \'ery 
simple example will illustrate the utility of this cknice in 
representing business data. The figures presented in the 
following table may be employed. 

Table 1 

Production of Passenger Automobiles in the bbi/tcrf States, by 
Mo7iihs, During the Year 1937 


Month 

X -umber of 
p(mengLr earn 

January 

nianufaciured 

309,037 

February 

296,636 

March 

403,879 

April 

439,980 

May 

425,432 

June 

4.11,394 

July 

360,403 

August 

311,456 

September 

118,671. 

October 

298,662 

November 

295,328 . 

Decemlier 

244,385 


These data may be represented graphically on the co- 
ordinate system, months being laid off along the r-axis and 
number of automobiles along the y-axis, as in the accom- 
panying diagram (Fig. 2) . In plotting the abscissas, Decem- 
ber, 1936, is considered as located at the point of origin. 
The x-value of the entry for January, 1937, is thus 1, of 
the February figure 2, etc. The coordinates of the point 
representing the number of ears produced in January, 1937, 
are 1 and 309,637 ; for February the values are 2 and 296,636. 
The coordinates for December are 12 and 244,385. The 



GRAPHIC PRESENTATION 


11 


Thousands 
of cars 



Fig. 2. — Production of Passenger Automobiles, by Months, During the 

Year 1937 


movement of automobile production during the year may 
be more easily followed if the points are connected by a 
series of straight lines, as is done in the figure. 

Independent AND Dependent Vaeiabees 
In the location of any point by means of coordinates it 
has been pointed out that two values are involved; every 
point ties together and expresses a relation between two 
factors. In the above case these are months and number of 
passenger automobiles produced. With the passage of time 
the volume of automobile production changes, and the 
broken line shows the direction and magnitude of these 
changes. Both time and number of cars produced are vari- 
ables, that is, they are quantities not of constant value but 
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characterized by variations in value in the given discussion. 
Thus in Fig. 1 the abscissa has a fixed value of 6, while the 
ordinate has a fixed value of 5, but in Fig. 2 both abscissa 
and ordinate have varying values, the one varying from 1 
to 12, the other from 118,671 to 439,980. The symbols x 
and 2 / are, by convention, used to designate such variable 
quantities as these, the former in all eases representing the 
variable plotted along the horizontal axis, the latter rep- 
resenting the variable plotted along the vertical axis.* 

In Fig. 2, which depicts the changes taking place in 
automobile production with the passage of time, it will be 
noted that the latter variable changes by an arbitrary unit, 
one month. Having made an independent change in the 
time factor we then determine the change in price taking 
place during the period thus arbitrarily chopped out. The 
variable which increases or decreases by increments arbi- 
trarily determined is called the independeyit variable, and is 
generally plotted on the x-axis. The other variable is termed 
the dependent variable, and is plotted on the y-axis. This 
dependence may be real, in the sense that the values of the 
second variable are definitely determined by the values of 
the independent variable, or it may be purely a conven- 
tional dependence of the type described. Time, it should 
be noted, is always plotted as independent, when it consti- 
tutes one of the variables. 

Functional Relationship 

When the relationship between two variables is one of 
complete dependence, so that the value of y is uniquely 
determined by a given value of x, y is said to be a fimction 
of X. The general expression for such a relationship is 
y - fi^)- Thus the speed at which a body is falling at a 

^ It should be noted that iettei^ at the end of the alphabet are used as 
symbols for variables, wdiile letters at the beginning of the alphabet are used as 
symbols for consiantSy i»e., quantities the values of which do not change in the 
given discussion. 
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given moment is a function of the time it has been falling, 
the pressure of a given volume of gas is a function of its 
temperature, the increase of a given principal sum of money 
at a fixed rate of interest is a function of time. If the values 
of the independent variable be laid off on the avaxis of a 
rectilinear chart and the corresponding values of the func- 
tion (i.e., the dependent variable) be laid off on the y-axis, 
a graphic representation of the function will be secured, in 
the form of a curve. * This concept of functional relationship 
is a very important one in statistical work. Some of the 
simpler functions may be briefly discussed. 

THE STKAIGHT LINE 

If two variables are so related that their values are always 
the same, their relationship is obviously of the form y = x. 
As a very simple example, the relation between the age of a 
tree and the number of rings in its trunk may be cited. 
A tree 6 years old wall have 6 rings, for 20 years there will 
be 20 rings, and so on. This relationship may be represented 
on a coordinate chart, several sample values of x and y 
being taken. When these points are plotted and a line drawn 
through them, we secui'e a straight line passing through the 
.origin and (assuming the two scales to be equal) bisecting 
the right angle XOY (cf. Fig. 3). 

• Similarly, any equation of the first degree (i.e., not in- 
volving xy, or powers of a: or y above the first) may be rep- 
resented by a straight line. The generalized equation can 
be reduced to the form y = a + bx, where a is a constant 
representing the distance from the origin to the point of 
intersection of the given line and the y-axis, and 6 is a con- 
stant representing the slope of the given line (that is, the 
tangent of the angle which the line makes with the hori- 
zontal). The constant term a is called the y-intercept. It is 
clear from the generalized equation of the straight line that 

^ The general term ;is used' to designate' any line, straight or curved, 

when located with reference to a coordinate system. 
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when X has a value of zero, y will be equal to this constant 
term. In the example given above (Fig. 3) a is equal to 0, 
and 6 to 1. The location of a given line depends upon the 
signs of a and b as well as upon their magnitudes. The prac- 
tical problem involved in the determination of any straight 



line is that of finding the values of a and h from the data, 
a problem which will appear in various forms in the discus- 
sion of statistical methods. 

These points may be illustrated by the plotting of a 
simple equation of the first degree. Thus, to construct 
the graph of the function, y = 2 + 3a:, various values of 
X are assumed, and corresponding values of y are deter- 
mined. These may be arranged in the form of a table: 
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V 

(2 + 3a:) 
- 10 
- 4 
2 
8 
14 


Plotting these values and connecting the plotted points, 
the graph illustrated in Fig. 4 is secured. It will be noted 



Fig. 4. — Graph of the Equation ?/ = 2 + 3a; 


that since this function is linear (that is, the graph takes 
the form of a straight line) any two of the points would have 


16 


STATISTICAL METHODS 


been sufficient to locate the line. The Jz-intercept is equal 
to the constant term 2, and the tangent of the; angle which 
the gi\'en line makes with the horizontal (the slope of the line) 
'is equal to 3, the coefficient of x. That this curve repre- 
sents the equation is proved by the fact that the equation is 
satisfied by the coordinates of every point on the curve, and 
that every pair of values satisfying the equation is represented 
by a point on the curve. It is characteristic of a linear 
relationship that if one variable be increased by a constant 
amount, the corresponding increment of the other variable 
will be constant. In the above case as x grows by constant 
increments of 2, for example, the constant increment of the 
^/-variable is 6. Series which increase in this way'' by con- 
stant increments are termed arithmetic series. 

Many examples of linear relationship betw'een variables 
are found in the physical sciences. An example from the 
economic world is found in the grow'th of money at simple 
interest, that is, interest which is not compounded. If we 
let r represent the rate of .simple interest, x the number of 
years, and y the sum to which one dollar will amount at 
the end of x yeai-s, the equation of relationship is of the 
form 

y — 1 -f ra:. 

Since in a given case r will be constant, this is of the simple 
linear type. In statistical work precise relationships of 
this type rarely if ever occur, but approximations to the 
straight line relationship are found constantly. 

NON-IillsrEAR KELATIONSHIP 

Non-linear functions are of many types, of which only 
a few of the more common will be discussed here. The 
student should be familiar with the general characteristics 
of the chief non-periodic curves, of which the parabolic 
and hyperbolic types, on the one hand, and the exponential 
type on the other, are the most important. The potential 
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series is mentioned as a more general form of rather wide 
utility. Of periodic functions the sine curve is briefly de- 
scribed, as a fundamental form. 

Functional relationships of the parabolic or hyperbolic 
form are quite common in the physical sciences, and such 
curves are found to fit certain classes of economic data. 
The general equation, when there is no constant term, is 
of the form y = ax'>. The curve is parabolic when the ex- 
ponent b is positive, and hyperbolic when b is negative. 
The two following examples will serve to illustrate these 
types; 

Problem: To construct the graph of the function y = 


X 

- 5 

- 4 

- 3 

- 2 
- 1 

0 

1 

2 

3 

4 

5 


y 

iz^) 

25 

16 

9 

4 

1 

0 

1 

4 

9 

16 

25 


The graph is shown in Fig. 5. 

Problem: To construct the graph of the function y = x-^, 
for positive values of a:. 


X 


•S' 

1 

1 

2 

3 

4 

5 


y 

(ari) 

3 

2 

1 
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The graph of this function, an equilateral hyperbola, is 
shown in Fig. 6. It should be noted that this equation 

may also be written y or xy = l. 

It is characteristic of relationships of this type that as a: in- 
creases in geometric progression, y also increases in geometric 
progression. Thus, in the example of the parabola given above 
(y = x^), if we select the x values which form a geometric 
series,^ the corresponding y values form a similar series: 

X 1 2 4 8 16 32 

y 1 4 16 64 256 1,024 

Another class of functions is of the form represented by 
the equation y = ab°. In equations of this type one of the 
variable quantities occurs as an exponent; graphs repre- 

* A geometric series is one each term of which is derived from the preceding 
term by the application of a constant multiplier. 
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Fig. 6. — Equilateral Hyperbola: Graph of the Equation y = a:”* 
(for positive values of x) 


senting such equations are called exponential curves. The 
example which follows illustrates the type. 

Problem: To construct the graph of the function y = 2*, 
for positive values of rr. 


X 

0 

1 

2 

3 

4 

5 

6 


y 

(2T 

1 

2 

4 

8 

16 

32 

64 
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This graph is shown in Fig. 7. 

It has been noted that the relationship between two 
variables which increase by constant increments (consti- 
tuting arithmetic series) may be represented by a straight 
line, and that the relationship between variables iiw'reasing 



Fig. 7. — Exponential Curve: Graph of the Equation y ~ 2’' (for posi- 
tive values of x) 

in geometric progi-ession may be represented by either a 
parabola or a hyperbola. The exponential curve constitutes 
a hybrid type. It describes a relation in which one variable 
increases in arithmetic progression while the other increases 
in geometric progression. The figures given above illustrate 
this relationship. 
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Cui’ves based upon relationships of the following type have 
been employed extensively in statistical inquiries; 

y=a+hx+cx^+dx^-\- .... 

The term •potential series has been applied to equations of 
this type. Though such curves do not constitute parabolas 
of the strict conic section type, a curve based upon such 
an equation carried to the second power of x is termed a 
second degree parabola, to the third power of x, a third 
degree parabola, etc. No uniform and simple type is secured 
from this series. It is treated in more detail at a later point. 

Periodic functions constitute another distinct tj'pe, a class 
represented notably by electrical and meteorological rela- 
tions, though not confined to these fields. The character- 
istic feature of such relationships is that values of the de- 
pendent variable repeat themselves at constant intervals 
of the independent variable. The sine cxu've, the basic type 
of this class, is illustrated in the following example. 

Problem: To construct the graph of the function -y = sin x. 


X 

y 

(angle in degrees) 

(sin x) 

0° 

.000 

30° 

.600 

60° 

.866 

90° 

1.000 

120° 

.866 

150° 

.500 

180° 

.000 

210° 

- .500 

240° 

-.866 

270° 

- 1.000 

300° 

- .866 

330° 

- .500 

360° 

.000 

390° 

.500 


etc. 


The graph is shown in Fig. 8. 



22 


STATISTICAL METHODS 


The full importance in statistical work of securing a 
mathematical expression for the relation between two vari- 
ables cannot be demonstrated until the subject has been 
further developed. One fundamental object is the deter- 
mination of physical or economic laws underlying observed 
phenomena. Another more practical object is the securing 



of a formula by means of which values of one variable may 
be approximated from given values of the other. Examples 
throughout the book will serve to illustrate how these objects 
are attained.^ 

Logakithms 

Logarithms, which play such an important part in gen- 
eral mathematical operations, are of equal importance in 
the manipulation of the raw materials of statistics. The 
nature of logarithms, and the methods by which they are 
employed to facilitate arithmetic processes, may be briefly 

^ A fuller discussion of different curve types is presented below, in the section 
dealing with the analysis of time series. 



LOGARITHMS 


23 


reviewed. This discussion is concerned only with the common 
system of logarithms of which the base is 10. 

Any positive number may be expressed as a power of 10. 
Thus 

1,000 = 10 X 10 X 10 = 10® 

10,000 = 10 X 10 X 10 X 10 = 10^. 

In each case the exponent of 10 (the small number written 
above and to the right) indicates the number of times the 
figure 10 is repeated as a factor. For the integral powers 
of 10 the exponent is a whole number, but for the other 
numbers the exponent will contain a fractional value. Thus 
100 is equal to 10 raised to the power 2, or 10®; 110 is equal 
to 10 raised to the power 2 . 04139, or 10® “^'®®. 

The exponent of 10, or the index of the power to which 10 
must be raised to equal a certain number, is called the loga- 
rithm of that number. The logarithm of 100 is 2, the logarithm 
of no is 2.04139, the logarithm of 998 is 2.99913. These 
figures all have reference to the base 10, though a system 
of logarithms might be developed on any base. In general, if 

a — h'^ 
logi a == c 

which may be read “the logarithm of a to the base b is 
equal to c.” The relation between the given number, the 
base and the logarithm, when the common system of loga- 
rithms is employed, may be easily remembered if the follow- 
ing relations are kept in mind: 

100 = 10 ® 
logiolOO = 2. 

The logarithm of any number has two parts, the integral 
and the decimal. The whole number is called the charac- 
teristic, and the decimal portion is termed the mantissa. 
The former is determined in a given case by inspection, 
while the mantissa may be obtained from logarithmic tables. 
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The charaeteristic varies with the location of the decimal 
point, while the mantissa remains the same for any given 
combination of numbers. This fact is illustrated bj' the 
following figures: 

log of 8,450 = 3.02686 

log of 845 = 2.02686 

log of 84.5 = 1 . 92686 

log of 8.45 = .92686 

log of .845 = 9.92686 - 10 

log of .0845 = 8.92686 - 10. 

In finding the natural number to which a given logarithm 
corresponds (such natural numbers ai’e termed anti-loga- 
rithms), the mantissa determines the sequence of figures, 
while the whole number, or chai’acteristic, determines the 
location of the decimal point. For example, in seeking the 
anti-logarithm of 2.17609 it is found that the decimal 
.17609 follows the natural number 1,500, in a table of 
logarithms. Since the characteristic is 2, the natural num- 
ber desired must lie between 100 and 1,000, and mustl here- 
fore be 150. 

A brief study of the following figures, showing the pro- 
gression of numbers corresponding to certain powers of 10, 
will help to fix in mind the relations between the multiples 
of 10 and their logarithms, and will enable the characteristic 
of a desired logarithm to be readily determined. 

.0001 .001 .01 .1 1 10 100 1,000 10,000 

10-4 10-’ 10-= 10-1 lo® 10‘ 10= lO'V 10'“ 

The exponents of 10 in the lower row are the logarithms 
of the numbers in the upper row. 

It should be noted that the logarithms of all numbers 
from 0 to 1 are negative. Thus the logarithm of .845 is 
- 1 •+■ .92686; this is written 9.92686 - 10. In covering 
the range of all positive natural numbers from zero to infin- 
ity, logarithms traverse all positive and negative values. 
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A negative natural number, therefore, can have neither a 
positive nor a negative logarithm. 

The advantage of thus expressing numbers as powers of 
10 lies in the fact that the ordinary arithmetic operations of 
multiplication, division, raising to powers and extracting 
roots are greatly facilitated by this procedure. 

To multiply numbers, add their logarithms. The sum 
of the logarithms of the factors is the logarithm of their 
product. In general terms: 

X 

Specifically 

102 X 102 = (10 X 10) X (10 X 10 X 10) = 10» = 100,000 
100 X 1,000 = 100,000. 


To divide one number by another, subtract the loga- 
rithm of the latter from the logarithm of the former. The 
remainder is the logarithm of the desired quotient. 

In general terms: 

of' T- a' = 

Specifically 


10 X 10 X 10 X 10 X 10 
10 X 10 


= 102 = 


100,000 ^ 100 


= 1 , 000 . 


To raise a given number to any power, multiply the 
logarithm of the number by the index of the power. The 
product is the logarithm of the desired power. 

In general ternas : 

(a‘)' = 

Specifically 

(102)2 = (10 X 10 X 10) X (10 X 10 X 10) = 10« = 1,000,000 

1,0002 = 1 , 000 , 000 . 

To extract any root of a given number, divide the loga- 
rithm of the number by the index of the root. The quo- 
tient is the logarithm of the desired root. 
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In general terms : 


Specifically 


In summary; 




= K# - 

i^COOpOO = 100. 


log (a X b) = log a + log b 
log (a -r-b) = log a ~ log h 
log o'" — b X log a 
log v'a = log a -i- b. 


These characteristic advantages of logarithms have been 
made use of in the construction of the slide rule, an instru- 
ment for reducing routine toil which should be familiar to 
all students of statistics. 


LOGARITHMIC EQUATIONS 

The graphic representation of data by means of a system 
of rectangular coordinates has been described above and 
some of the advantages of this metliod have been outlined. 
For many purposes it is desirable to plot logarithms rather 
than the natural numbers themselves. T'hls may result in 
bringing out significant relations more distinctly, or it may 
serve greatly to simplify and facilitate the manipulation of 
data. In particular, when it is possible through the use of 
logarithms to reduce a complex curve to the straight line 
form, a distinct gain has been made in the direction of 
simplicity of treatment and interpretation. 

A linear equation, it will be recalled, is of the general 
form y ~ a + bx, where a and b are constants which meas- 
ure, respectively, the jr-intereept of the given line and the 
slope. The simplification of equations through the use of 
logarithms involves in all cases the substitution of logx or 
log Vt or both, for the x or y variables, thereby reducing an 
equation of a higher order to a simpler form. 
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m 

This process may be illustrated with reference to the 
equation y = x^. When plotted on rectangular coordinates 
this equation gives a curve of the parabolic type (cf. Fig. 5). 


Natural Numbers 



Pig. 9. -—Graph of the Equation log 2 / = 2 log a: (Logarithmic form of 
the equation y = x^) 


Reduced to logarithmic form this becomes log y = 2 log x. 
This equation, in which the variables are log and log a, 
is linear in form. It is plotted in Fig. 9, for positive values 
of log X. To indicate the relations involved, natural numbers 
corresponding to the logarithms are given on scales to the 
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right and at the top of the figure. The natural numbers 
appearing on the scales constitute geometric' series, while 
their logarithms form arithmetic series. Ecjual vertical 
distances on the chart, it will be noted, rf'iiresent eipial 
absolute increments on the scale of logarithms and ecjual 
percentage increments on the scale of naturid numbers. 

The equation y - 5a:® can be reduced in the same way 
to log ?/ = log 5 + 3 log X, a linear form. Similarly, all 
equations of the type y = ax\ that is to say, all simple 
parabolas and hyperbolas, can be reduced to the straight 
line form log y = log a + b log x. Graphically thi.s means 
plotting the logarithms of the y’s against the logarithms of 
the a:’s. 

A different problem is presented by an equation of the 
type y - ab^, the graph of w'hich is termed an exponential 
curve. Expressed in logarithmic form, wo have log ,?/ = 
log a -f a' log 5. This is also of the linear tyj^e, the two con- 
stants being log a and log b, while the variables are x 
and log y. If we plot the natural a’s and the logs of 
the y’s, with this type of equation, a straight line will be 
secured. A curve of this type is discussed and illustrated 
below. 

LOGAEITHMIC AND SEMI-LOGARITHMIC CHARTS 

There are certain disadvantages to the plotting of loga- 
rithms, however. If a considerable number of points are 
being plotted the task of looking up the logarithms may be 
tedious, and, in addition, the original values, in which 
chief interest lies, will not appear on the chart. These 
difficulties may be avoided by constructing charts with the 
scales laid off logarithmically, but with the natural numbers 
instead of the logarithms appearing on the scales. This is 
an arrangement identical with that employed in the con- 
struction of slide rules. Thus, although the natural numbers 
are given on the scales, distances are proportional to the 
logarithms of the numbers thereon plotted. In Fig. 10 
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such a chart is presented, showing the graph of the equa- 
tion y = a:-. 

A variation of this type of chart which is of great im- 
portance in statistical work is one which is scaled arith- 
metically on the horizontal axis and logarithmically on the 
vertical axis. This is equivalent, of course, to plotting the 



Fig. 10. — Graph of the Equation y = (Plotted on paper with 
logarithmic scales) 

a:’s on the natural scale and plotting the logarithms of the 
y’s. As was pointed out above, such a combination of 
scales reduces a curve of the exponential type to a straight 
line. Plotting paper of this semi-logarithmic or “ratio” 
type may be constructed with the aid of a slide rule or 
of logarithms, or may be purchased ready made. It is of 
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paxticular value in charting economic statistics, because of 
the fact that time is usually one of the variables in such 
cases, and it is desirable to plot this variable on t he natural 
scale. 


tiG. 11. — The Compound Interest Law: Growth of $10.00 at Compound 
Interest at 6 per cent for 100 Years (Plotted on arithmetic scale) 

As an example of this type of curve the compound inter- 
est law may be used. If r be taken to represent the rate 
of interest, x the number of years, p the principal, and y 
the sum to which the principal amounts at the end of x 
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years (interest being compounded annually), an equation is 
secured of the form 

2/ = p(l + ry. 

Expressed logarithmically this becomes 

log 2 / = log p + a: log (1 + r), 
the equation to a straight line. 



Years 

Fig. 12. — The Compound Interest Law: Growth of SIO.OO at Gom- 
pouiid Interest at ' 6 per cent for 100 Years (Plotted on semi-logaritliHiic 
or; ratio scale) 

In Fig. 11 a curve representing the growth of ten dollars 
at compound interest at 6 per cent is plotted on the natural 
scale. This is the graph of the exponential equation 

2/ = 10(1 + .06)* 

y representing the total amount of principal and interest 
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at, the end of x years. Figure 12 shows the same (iata plotted 
on semi-logarithmic paper, the exponential <-urv(“ b{‘ing re- 
duced to a straight line. 

The use of semi-logarithmic paper is not cf)n fined to 
cases in which an exponential curve is straightened out , 
for the significance of many types of data is most effect iv{'Iy 
brought out when charts of this tyix‘ are used. I'hcse 
advantages are more fully explained below. 

The Construction of Ch.vrts 

When the results of observations or statistical investi- 
gations have been secured in quantitative form, one of the 
first steps toward analysis and interpnfiation of the data 
is that of presenting these results graphically. Not only is 
such procedure of scientific value in paving the way for 
further investigation of relationships, but it serves an im- 
mediate practical purpose in visualizing the results. A visual 
stimulus opens up a far more direct path 1o our understand- 
ing and imagination than that afforded by t he m<jre i-ecenlly 
developed processes of reasoning. The interpretation of a, 
column of raw figures may be a difficult task; the same data 
in graphic form may tell a simple and easily understood story. 
For these reasons graphic methods of presentation have come 
to play a highly important part in the everyday activities 
of business, as well as in the laboratory and i’afting room. 

It is beyond the scope of this book to present any detailed 
account of the multiplicity of graphs employed bj'^ engineers 
and statisticians today. Certain of the more important 
principles of graphic presentation may be briefly explained, 
however, and some of the chief types of graphs which are 
in daily use may be illustrated. Other examples appear in 
later chapters of this book. 

FACTORS GOVERNING THE SELECTION OF A CHART 

The selection of the type of chart to be employed in a 
given case will depend upon two general considerations. 
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The first of these relates to the character of the material 
to be plotted. While the data of a given problem may 
frequently be presented graphically in several difierent 
forms, there is generally one tjrpe of chart best adapted 
to that material. It may be true, also, that certain tj'pes 
would be quite inappropriate to the data in question. The 
selection of a type of chart to employ, therefore, must 
be made with the characteristics of the data clearly in 
mind. 

Perhaps more important is the purpose which the given 
chart is designed to serve. Each of the many types 
of charts in common use is appropriate to certain speci- 
fic pmposes. It will bring out certain characteristics of 
the data or will emphasize certain relationships. There 
is no chart which is sovereign for all purposes. Until the 
purpose is clearly defined the best chart form cannot 
be selected. The following descriptions of a few stand- 
ard types will facilitate the selection of an appropriate 
form. 

CHARTS ADAPTED TO THE PLOTTING OP TIME SERIES 

In the graphic presentation of a time series, primary 
interest attaches to the chronological variations in the values 
of the data, to the general trend and to the fluctuations 
about the trend. If the purpose is to emphasize the absolute 
variations, the differences in absolute units between the 
values of the series at different times, a simple chart of the 
t3q)e illustrated in Fig. 13 will serve the purpose. This 
chart depicts annual wheat flour exports from the United 
States during the period 1913-1936. Both scales are arith- 
metic. Points representing the various annual values are 
shown and, to facilitate interpretation, these points are 
connected by a series of straight lines. The chart tells a 
simple story of year-to-year fluctuations, with a sharp ad-, 
vance at the end of the World War, a decline as the post-war 
emergency passed, several years of moderate growth, and a 
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Fig. 13. — Wheat Flour Exports from the United States, 1913-1936 

severe decline as the world depression deepened in the early 
thirties. With respect to general make-up, the following 
points should be noted : 

1. The title constitutes a clear description of the material plotted 

and indicates the period covered. 

2. The vertical scale begins at the zero line, enabling a true 

impression to be gained of the magnitude of the fluctua- 
tions, 

3. The zero line and the line joining the plotted points are ruled 

more heavily than the coordinate lines. 

4. Figures for the scales are placed at the left and at the bottom 

of the chart. The vertical scale may be repeated at the 
right to facilitate reading. All figures are so placed that 
they may be read from the base as bottom or from the 
right hand edge of the chart as bottom. 
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5. The ?/-values of the plotted points are given at the top of the 
chart. This practice is helpful, though not necessary, as 
the values may be presented in a separate table. 

ADVANTAGES OP THE RATIO CHART 

If relative rather than absolute variations are of chief 
concern, the chart employed should be of the semi-loga- 
rithmic type, scaled logarithmically on the y-axis and arith- 
metically on the .T-axis. In such a chart equal percentage 
variations are represented by equal vertical distances, as 
opposed to the ordinary arithmetic type in which equal 
absolute variations are represented by equal vertical dis- 
tances. The argument for the use of the semi-logarithmic 
or ratio chart for the representation of time series is that, 
in general, the significance of a given change depends upon 
the magnitude of the base from which the change is naeas- 



Fig. 14. — Production of Steel Ingots and Castings in the United States, 
1896-1936 (Plotted on semi-logarithmic scale) 


ured. That is, an increase of 100 on a base of 100 is as 
significant as an increase of 10,000 on a base of 10,000. In 
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each case there is an increase of 100 per ceni . I’hc absohif e in- 
crease in the second case is 100 times tliat in the first ease, 
and the two changes would show in this propor1i(»n on the 
arithmetic chart. They w’ould show as cif equal importance 
on the semi-logarithmic chart. 

Such a chart is presented in Fig. 14, which shows the 
course of steel production in the Fnited States from 1896 
to 1936. The absolute magnitudes are plotted, but the 
vertical scale is so constructed as to represent variations 
from year to year in proportion to their relatiA'e magnitude. 



Fig. 16.— Exports of the United States, 1920-1936 Showing Total Ex- 
ports and Exports to Selected Areas (Monthly averages for the years 
named axe plotted on an arithmetic scale) 


Certain distinctive advantages of the ratio or logarithmic 
ruling are brought out by a comparison of Fig. 15 and 
Fig. 16. The data presented graphically in these two charts 
are shown in Table 2: 
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Table 2 


Exports of the United States, 1920-1936 
(Monthly averages, in thousands of dollars) 



To 

To 

To 

To 

Total to 

Grand 


France 

Gerjna?!;!/ 

Italy 

Umtea 

Kingdojn 

Europe 

total 

1920 

156,349 

125,953 

$30,980 

$161,319 

1372,174 

$685,668 

1921 

18,745 

31,027 

17,955 

78,510 

196,992 

373,753 

1922 

22,247 

26,343 

12,575 

71,319 

173,613 

319,315 

1923 

22,678 

26,403 

13,961 

73,527 

174,451 

347,291 

1924 

23,472 

36,702 

15,595 

81,912 

203,775 

382,582 

1925 

23,358 

39,195 

17,096 

86,155 

216,979 

409,154 

1926 

22,000 

30,347 

13,117 

81,051 

192,512 

400,722 

1927 

19,065 

40,140 

10,971 

70,005 

192,576 

405,448 

1928 

20,058 

38,938 

13,510 

70,613 

197,912 

427,363 

1929 

22,133 

34,204 

12,831 

70,667 

195,070 

435,083 

1930 

18,663 

23,189 

8,369 

56,509 

153,198 

320,265 

1931 

10,152 

13,838 

4,568 

37,923 

95,040 

202,024 

1932 

9,297 

11,139 

4,095 

24,027 

65,358 

134,251 

1933 

10,143 

11,669 

5,103 

25,978 

70,815 

139,583 

1934 

9,642 

9,062 

5,381 

31,896 

79,150 

177,733 

1935 

9,751 

7,665 

6,035 

36,117 

85,770 

190,240 

1936 

10,795 

8,382 

4,900 

36,662 

86,694 

204,457 


(Data compiled by Bureau of Foreign and Domestic Commerce, IJ. S. 
Department of Commerce.) 


If the six series are to be presented on a single chart, 
scaled arithmetically, a scale must be selected which will 
include the largest item recorded, $685,668,000. Such a 
scale reduces the relative importance of the smaller magni- 
tudes. From Fig. 15 it appears that during the period cov- 
ered by the chart very large fluctuations occurred in total 
exports, substantial but somewhat smaller movements oc- 
curred in exports to Europe, and that exports to the four 
individual countries suffered much less severe fluctuations. 
Such a picture is quite misleading. The true state of affairs 
is reflected in Fig. 16, in which the same data are plotted 
on paper with a semi-logarithmic ruling. Fluctuations in 
exports to the individual coimtries are here seen to have 
been relatively greater than the movements of total exports 
For the purpose of comparing series which differ materially 



Millions of Dollars 
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with respect to the magnitude of the individual items, the 
arithmetic ruling is quite useless, giving a thoroughly dis- 
torted picture of the true relations. The ratio ruling permits 
a legitimate comparison. 

The scales printed below Fig. 16 emphasize certain very- 
useful features of the logarithmic ruling. The scaU of in- 
crease may be used to measure with a fair degree of accu- 
racy the increase in a given series bet-n'een any two dates. 
A given vertical distance on the chart, it -will be recalled, 
represents a constant percentage increase at all points on 
the chart. Thus the distance from 1 to 10, along the vertical 
scale, is the same as the distance from 100 to 1,000. Any 
vertical distance may be measured, and the percentage of 
increase which it represents may be determined by laying 
off the given distance along the scale of increase, which is 
always read from the bottom up. For example, to determine 
the degree of increase in total exports from 1932 to 1935, 
we measure the vertical distance between the points plotted 
for these two years. Laying off this distance along the scale, 
it is found to represent about a 40 per cent increase. 

The scale of decrease is used in a similar fashion. The 
vertical distance between any two points is measured, and 
the percentage decrease which it represents is determined 
by laying off the given distance on the scale from the top 
downward. The arrows indicate the direction in which the 
various scales are to be read. 

By means of the scale of comparison the percentage relation 
of one series to another at any time may be determined. 
For example, we may wish to know the percentage relation 
between exports to Europe and total exports in 1935. The 
vertical distance between the two plotted points is measured, 
and laid off on the scale of comparison, reading from the top 
downward. It is found to be approximately 45 per cent. 

Scales of the type illustrated above may be readily con- 
structed on a given chart by using the ratio ruling for the 
scale intervals. When a series of charts is prepared on 
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semi-logarithmie paper of a standard type it is coina'iiient 
to construct such scales in a more permaiuait foian, in 1h<> 
shape of special rulers. 

A ratio chart is particularly useful when interest attaches 
to rates of growth (or decline) over a coiisid(‘ral)le period of 
time. In such a case, the reading of the chart is facilitated 
by the plotting of straight diagonal lines indicating uniform 



Fig. 17. — Production of Rayon Filament Yarn in the United States, 
1912-1937, With Lines Defining Uniform Rates of Growth 


rates of change. These should radiate from a single point of 
origin. The procedure is illustrated in Fig. 17. The diagonal 
lines there shown indicate changes at uniform rates ranging 
from 10 per cent to 50 per cent per year. By reference to 
these lines the user of the chart may readily determine the 
approximate rate of growth of the plotted series between 
any two years. 

The chief advantages of the semi-logarithmic ruling in 
chart construction may be briefly summarized : 

1. A curve of the exponential type becomes a straight line when 
plotted on a semi-logarithmic chart. For example, a curve 
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representing the growth of any sum of money at compound 
interest takes the form of a straight line when so plotted. 

2. In any series, so long as the rate of increase or decrease remains 

constant the graph will be a straight line on this ruling. 

3. Equal relative changes are represented by lines having equal 

slopes. Thus two series increasing or decreasing at equal 
rates will be represented by parallel lines. 

4. Comparison of the rates of change in two or more series is 

effected by comparison of the slopes of the plotted lines. 

5. The semi-logarithmic ruling permits, at the same time, the 

plotting of absolute magnitudes and the comparison of 
relative changes. 

6. Comparison of series differing materially in the magnitude of 

individual items is possible wdth the semi-logarithmic chart. 

7. Percentages of change may be read and percentage relations 

between magnitudes determined directly from the chart. 


CHARTS FOE THE COMPARISON OF FREQUENCIES 

A different type of chart is called for when the object is 
the comparison of frequencies, that is, numbers of events 
or things of different classes. The following census figures 
may serve to illustrate the problem. 


Table 3 

Farms in New England States in 1935 
State Number of farms 

Maine 41,907 

New Hampshire 17,695 

Vermont 27,061 

Massachusetts 35,094 

Rhode Island 4,327 

Connecticut 32,167 


A graphic comparison of these six states with respect to num- 
ber of farms in 1935 is afforded by the bar diagram in Mg. 18, 
This is a simple but effective t 3 q)e of chart for this purpose. 

Further examples of this type of chart, as employed' 'in 
the representation of frequency distributions, are contained 
in the next chapter. It is .thera^shown how a frequency 
polygon or frequency curve may grow out of the simple 
bar diagram, when data of certain kinds are being handled 
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Fig. 18. — Farms in New England States in 1935 


Such frequency curves constitute very important graphic 
types, but it will be more appropriate to treat them in 
full at a later point. 

CHARTS FOR THE REPRESENTATION OF COMPONENT PARTS 

It is frequently desirable in tabular and graphic presenta- 
tion to break up a total into its component parts, in order 
that changes in the parts as well as in the total may be fol- 
lowed. The table on page 43 exemplifies this procedure. 

These figures are presented graphically in Fig. 19, which 
reveals the varying post-war fortunes of different interests 
in American manufacturing industries. It is clear from the 
diagram that the general swings of material costs, labor costs 
and overhead costs in American manufacturing industries 
have paralleled the fluctuations in total value of products. 
Some of the movements of the component items are of ex- 
ceptional interest, however. Overhead costs (with which 
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Table 4 

Total Value of Products and Elements of ProdMction CosiSy 
■ Manufacturing Industries of the United States ^ 

1919-1935 


(Millions of dollars) 


Year 

Cost of 

Labor cost 

Overhead cost 

Total value 

fnaterials ^ 

{ivages) 

plus profits ^ 

of products 

1919 

$37,233 

$10,462 

$14,347 

$62,042 

1921 

25,321 

8,202 

10,130 

43,653 

1923 

34,706 

11,009 

14,841 

60,556 

1925 

35,936 

10,730 

16,048 

62,714 

1927 

34,803 

10,836 

16,639 

62,278 

1929 

38,178 

11,607 

20,176 

69,961 

1931 

21,681 

7,173 

12,184 

41,038 

1933 

16,821 

5,262 

9,276 

31,359 

1935 

20,264 

7,545 

11,951 

45,760 



Fig. 19. — Total Value of Products and Elements of Production Costs, 
Manufacturing Industries of the United States, 1919-1935 

^ Including containers, fuel and purchased electric energy. 

2 This item represents tiie difference between total direct costs (materials 
and wages) and total value of products. It includes overhead costs proper, 
plus salaries, taxes, profits, etc. 
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profits are here ■combined) ' showed a expansion be- 

tween 1921 and 1929 . 'The great receHsioii that followed 
.squeezed all the elements of the total, forcing them to levels 
well below those of the 1921 depression. 


CUMULATIVE CHA.KTS 

■: In many cases chief interest in Hie development of a 
series attaches not to the value of each suc^eessive item but 
to the cumulated total of a number of such items. This 
may be so when a yearly production program has been 
laid out. In such a case it is the relation between' cuinu- 
lated production to date and scheduled production to date 
which is of major interest, and a chart form is needed 
which will enable this comparison to be made. The fol- 
lowing figures illustrate the type of data for which such 
charts are appropriate. 


Table 5 


Cumulative Production Schedule ami Cirntdcdwe 


Month 


January 

February 

March 

April 

May 

June 

July 

August 

September 

October 

November 

December 


Output, 1936 

(Speedwell Automobile Company) 


Production 

schediilp 

Cumulatm 

production 

Output 

{cars) 

schedule 

(cars) 

(ears) 

8,000 

8,000 

6,125 

10,000 

18,000 

9,250 

12,000 

30,000 

10,514 

16,000 

45,000 

15,131 

14,000 

59,000 

12,159 

12,000 

71,0CK) 

13,250 

11,000 

82,000 

11,462 

10,000 

92,000 

10,531 

6,000 

98,000 

4,621 

9,000 

107,000 

9,843 

10,000 

117,000 

13,785 

10,000 

127,000 



Cumukitim' 
output 
(ears) ■ 

6,125' 
15,375 ■■ 
■•25,889 
■41,020 
53,179 
66,429 ■ 
77,891 
88,422 
93,043 
102,886 
116,671 


It is assumed that' this ' table represents the situation as 
of the end of November. 
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In Fig. 20 the two cximulative curves are plotted. The 
relation between actual and scheduled production at the 
end of each month is shown on the chart, and it is possible 

No. of 
Cars 



Fig. 20. — Comparison of Scheduled and Actual Output (Cumulative) 
Speedwell Automobile Co. 1936 


from the scale to read the approximate amount by which 
production is behind schedule. By reference to the figures, 
which should always accompany the chart, the exact rela- 
tion may be determined. Such a chart has many ap- 
plications, some of which are illustrated in the following 
chapter. 
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THE GANTT PROGHESH CHAHT 

The same data may be presented in a v(>ry effisdivo form 
by making use of a tjiie of chart d(*velopcd Ijy .'\Ir. li. k. 
Gantt. xVn adequate description of this eiiart and of 
many uses would far exceed the sjiaee which can 1 k> given 
to it here, but its characteristics may be indicated in a very 
brief account. j 

Once a schedule has been drawn up, the Gantt chart 
may be utilized in checking actual accomplishment against 
the schedule. Having such a schedule as that given in 
tablets, the monthly and annual quotas may he entered 
on a form similar to that shown in Fig. 21. The entry to 
the left of each monthly .space indicates the amount sched- 
uled for production during that month. The entry to the 
right of each monthly space indicates the cumulated sched- 
uled production to the end of the given month. In this 
figure the results of the first two months’ operations are 
shown. The heavy black line indicates the cumularild 
aetual^roduction during this period, amounting to U Z7r> 
ears. The narrow upper lines in the January and February 

P-^«otion m each „f thoee 
Z I actual production in either mouth had equaled 
the scheduled production the light line would extend Ltoss 
the full monthly space. WTien actual production in a given 
seheduled production a double light line 

period IS duuded represent equal time intervals but varying 
amounts in terms of actual production. Thus thl X! 

Til January quota being 8,000) 

had tX - 
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m 



Fig. 22 . — Comparison of Scheduled and Actual Output, 1936: Gantt Progress Chart (Showing the situation 

November 30th) 
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_ The situation at the end of November is shown in Pm. 22. 
_ e arrow at the top of the diagram indicates the point of 
time actually reached. That actual production is slightly 
^ehind scheduled production is apparent from the relation 
between this arrow and the heavy black line, whde the light 

lines indicating monthly production showthat actual outpu 
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graphics, is another report of the same committee on charts 
suitable for use as lantern slides.^ 
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CHAPTER 111 

THE OEGANIZATIOK OF S1’AT!STR^\L DATA: 
THE FREQUENCT DIBTIUBFTIOX 

The task of the statistician engagiRi in business or eco- 
nomic research includes the organization, analysis and in- 
terpretation of quantitative data relating to business affairs 
and to economic conditions. To these fundamental opera- 
tions that of collecting the original data may be added, 
though more frequently data will be compiled directly from 
primary or secondary sources. 

At the outset it is necessary to distinguish between the 
problems arising in the analysis of time series and those 
involved in the organization and analysis of materials in 
connection with which the time factor does not enter. In 
studying a time series the primary object is to measure and 
analyze the chronological variations in t he value of the 
variable. Thus one may study variations in sales over a 
period of years, fluctuations in the production of bituminous 
coal, or changes in the general level of prices. Quite differ- 
ent is the procedure in the study of .such a problem as 
income distribution at a given time. In this case we are 
desirous of knowing how many people in the United States 
fall in each of a number of income classes. The general 
problem of organization in this latter class of cases is to 
detei-mine how many times each value of a variable is re- 
peated and how these values are distributed. Data of this 
sort, when organized, constitute & frequency series, as opposed 
to the time or historical series. The methods appropriate 
to these two types of analysis differ fundamentally and will 
therefore be treated separately. In the present section we 
are concerned with the organization and preliminary analy- 
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sis of data in connection with which the time element, 
while it may be present, does not enter as a factor. 

Unorganized Data 

When quantitative data of the type with which the 
statistician works are presented in a raw state they appear 
as unorganized masses of material, without form or struc- 
ture. They may have been drawn from the production or 
sales records of a business establishment, or they may 
represent a miscellaneous collection of price quotations. If 
the data have been gathered by other agencies they may 
already have been arranged in the form of a general table, 
but this form may be entirely unsuited to the particular 
object in the mind of the investigator. The fii’st task of 
the statistician is the organization of the figures in such 
a form that their significance, for the purpose in hand, may 
be appreciated, that comparison with masses of similar 
data may be facilitated, and that further analysis may be 
possible. Scientific method, it has been noted, involves 
observation, inference, and verification. Data, the results of 
observation, must be put into definite form and given 
coherent structure before the process of inference is possi- 
ble. 

The figures on page 52, representing the earnings during 
a given week of 210 individuals engaged in piece work in a 
certain manufacturing establishment, will serve as an ex- 
ample of such data in their raw state. 

The Array 

If these figures are arranged in order of magnitude some- 
thing will have been done toward securing a coherent 
structure. The range covered and the general distribution 
throughout this range will then be clear, and the way will 
be prepared for further organization. When so arranged 
the array on page 53 is secured. 
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Weekly Earnings OK 210 Emplovee, 

526. 25 $28.70 

26.70 24.35 
28.20 27.30 

27.70 28,60 

24.30 27.80 

27.60 25.35 

26.15 29.30 

27.95 25.55 

27.30 27.90 

26.75 31.00 

30.25 28.55 

29.55 30.00 

25.75 26.25 

26.60 27.00 

26.30 27.25 

26.55 27.55 

30.70 28.60 

26.90 25.30 

26.55 27.80 

28.10 26.30 

24.10 25.15 

28.15 28.65 

29.00 23.00 

26.10 27.15 

24.55 25.80 

26.85 27.30 

28.60 26.00 


$24.15 

$29.75 

25,75 

27.20 

27.80 

26.35 

25.30 

27,80 

27.60 

26.30 

27.55 

29.00 

23.10 

27.10 

27.55 

26,60 

25.25 

24.10 

24.00 

25.35 

26.75 

24.60 

24.60 

25.75 

26.30 

26.75 

30.75 

28.60 

28.15 

29.10 

23.00 

24,50 

27.90 

26.80 

25.80 

28.85 

28.60 

30.55 

27.10 

24.60 

27.50 

24.25 

24.55 

25.85 

28. «) 

29.30 

26.75 

26.80 

26.75 

27.30 

28.10 

32.00 


$29.20 

28.30 

27.40 

26.40 

27.40 
24.10 

28.50 
24.25 
27.45 

26.50 
25.75 
26,30 
27.90 
28.10 

30.10 
22.85 

24.10 

27.55 
29.50 
27.80 
25.70 

26.10 

28.55 

27.15 

27.55 

28.15 


■’? 30.(!0 
25 . 25 

28 . 30 

27 . 30 
23.50 

27.00 
27.45 

30.00 

24.55 

28.30 

20 . 55 

27.00 

28.30 
23.50 
29.90 

26.55 
25.25 

27.30 
24.10 

26.30 
26.80 
27.00 
28,80 

26.30 
28.25 
26.30 


m 

$23.40 
27, 75 
26.00 
2S.35 
20.60 
24.50 
20.15 

28.55 

20 . 55 
27.95 
27.80 
28.25 
25.70 
24.75 

28.55 

27.55 
26.30 
25.00 
25.15 
27.90 
30. IS 
26.80 

27.55 

28 . 55 
25.60 
27.75 


824.75 
27.60 

25.75 

27.00 
27.80 

27.25 
28.35 

28.00 

27.55 

25.55 

28.90 

25.25 

26.30 

25.15 

27.30 
28.10 

27.90 
26,00 

27.15 

29.80 

29.30 
27.55 
23.60 

25.80 

26.30 
26.25 


Freqouncy Tables 

more suitable lor ^study^ thin 

first shown, there is stilfoA " haphazard distributio: 

mind can rea^y r 

The factoiy manager ^ 

earned during the week waJ $22 85 

earned was $32.00 and th«f largest amouni 

between $25 00 and $‘?Q on K employees earnec 

tlon of the dal^'^Bfa ™ 

putting into common n - !■ that is, by 

fall within certain limits a SmSffiT? earnings 

) imphfied and more compact 
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$22.85 

23.00 

23.00 

23.10 

23.40 

23.50 

23.50 

23.60 

24.00 

24.10 

24.10 

24.10 

24.10 

24.10 

24.15 

24.25 

24.25 

24.30 

24.35 

24.50 

24.50 

24.55 

24.55 

24.55 

24.60 

24.60 

24.60 

24.75 

24.75 

25.00 


Areay: Weekly Earnings 
$26.15 $26.75 


OF 210 Employees 


$25.15 

25.15 

25.15 

25.25 

25.25 

25.25 

25.25 

25.30 

25.30 

25.35 

25.35 

25.55 

25.55 

25.60 

25.70 

25.70 

25.75 

25.75 

25.75 

25.75 

25.75 

25.75 

25.80 

25.80 

25.80 

25.85 

26.00 

26.00 

26.10 

26.10 


26.15 

26.25 

26.25 

26.25 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.35 

26.40 

26.50 

26.55 

26.55 

26.55 

26.55 

26.55 

26.60 

26.60 

26.60 

26.70 

26.75 

26.75 


26.75 

26.80 

26.80 

26.80 

26.80 

26.85 

26.90 

27.00 

27.00 

27.00 

27.00 

27.00 

27.10 

27. 10 

27.15 

27.15 

27.15 

27.20 

27.25 

27.25 

27.30 

27.30 

27.30 

27.30 

27.30 

27.30 

27.30 

27.40 

27.40 


$27.45 

27.45 

27.50 

27.55 

27.55 

27.55 

27.55 

27.55 

27.55 

27.55 

27.55 

27.55 

27.60 

27.60 

27.60 

27.70 

27.75 

27.75 

27.80 

27.80 

27.80 

27.80 

27.80 

27.80 

27.80 

27.90 

27.90 

27.90 

27.90 

27.90 


$27.95 
27.95 

28.00 

28.10 

28.10 

28.10 

28.10 

28.15 

28.15 

28.15 

28.20 

28.25 

28.25 

28.30 

28.30 

28.30 

28.30 

28. 35 

28.35 

28.50 

28.55 

28.55 

28. 55 

28.55 

28.55 

28.60 

28.60 

28.60 

28.60 

28.60 


128.60 

28.65 

28.70 

28.80 

28.85 

28.90 

29.00 

29.00 

29.10 

29.20 

29.30 

29.30 

29.30 

29.50 

29.55 

29.60 

29.75 

29.80 

29.90 

30.00 

30.00 

30.10 

30.15 

30.25 

30.55 

30.60 

30.70 

30.75 

31.00 

32.00 


“r >My be obtained. The 

toiloiMEg table shows the results of this groupW nrocess 

ddlis*^^ cfass-mfomn is two 

This table presents a condensed summary of the oriffinal 

approximate 

the^Sin ® shows, also, how the earnings of 

distributed throughout this range. 
Thermo has been a considerable loss of detail, it will be 
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Tabi.e (> 

Frequency DisIribuUon of Emphyrr-^ 

(Classified on the basis of weekly earning.- If‘ias.-i-int(Tval .?2!5 


Weekly, earnings 

Number eandfig sfated fimmmt 

{fngm'netj) 

$22.00 to S23. 99 

8 

24.00 to 25.99 

48 

26,00' to 27.99 

m 

28.00 to 29.99 

47 

30.00 to 31.99 

10 

32.00 to 33.99 

1 

210 

From this table we may learn that there are 48 per.sons who 


earned during the given week between $24.00 and -$25.99, 
but we cannot learn how the earning.s of the 48 individuals 
were distributed throughout this range of two dollans. All 
may have earned exactly $24.00, so far as w'e may know 
from the figures showm in the table. This loss of detail is 
an inevitable accompaniment of the condensation and sim- 
plification which the process of classifit'ation involves. 

If the size of the class-interval bc' decreased the loss of 
detail is less pronounced, though the incretise in the number 
of classes means a more cumbersome talde and one which 
presents a more complex picture to the eye. The tables 
on page 55 present the same data, classified wit-h intervals 
of one dollar, fifty cents, and twenty-fivc' cents. 

The four tables we have thus constructed represent four 
different degrees of condensation of the same data. Table.s 6, 
7, and 8 present the same general characteristics: a small 
number of cases in the extreme classes and a more or less 
regular increase in the frequencies as the center of each of 
the distributions is approached. The departure from reg- 
ularity becomes greater the greater the number of classes. 
Table 9, in which the class-interval is 25 cents, has 38 classes. 
In this table the distribution of cases throughout the range 
is highly irregular, with pronounced departures from syin- 
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Freqtjbncy Distributions of Employees 


(Classified on the basis of weekly earnings) 


Table 7 


II Table 8 

Table 9 


( Class-interval 

« $1) 

(Class-interval = 

50 cents) 

(Class-interval = . 

25 cents) 

Weekly 

Fre- 

Weekly 

Fre- 


Fre- 

earnings 

quency 

earnings 

quency 

earnings 

'quency [ 

$22.00 to $22.99 

1 

$22.50 to $22.99 

1 

$22.75 to $22.99 

1 : 

23.00 to 23.99 

7 

23.00 to 

23.49 

4 

23.00 to 

23,24 

3 .: 

24.00 to 24.99 

21 

23.50 to 

23.99 

3 

23 . 25 to 

23.49 

1 : 

25. 00 to 25.99 

27 

24.00 to 

24.49 

11 

23 . 50 to 

23.74 

3 i 

26.00 to 26. 99 

42 

24.50 to 

24.99 

10 

23.75 to 

23.99 

0 

27.00 to 27.99 

54- 

25.00 to 

25.49 

12 

24.00 to 

.24,24 

7. i 

28.00 to 28.99 

34 

25.50 to 

25.99 

15 

24. 25 to 

24.49 

4 

29.00 to 29.99 

13 

26.00 to 

26.49 

22 

24 . 50 to 

24.74, 

8 ! 

30.00 to 30.99 

9 

26.50 to 

26.99 

20 

24 . 75 to 

24.99 

2 

31.00 to 31.99 

1 

27.00 to 

27.49 

24 

25.00 to 

25.24 

4' 

32. 00' to 32.99 

1 

27.50 to 

27.99 

30 

25 . 25 to 

25.49 

8 


210 

2S.00 to 

28.49 

17 

25.50 to 

25.74 

5 


28.50 to 

28.99 

17 

25. 75 to 

25.99 

10 



29. 00 to 

29.49 i 

7 

26.00 to 

26.24 

6 



29.50 to 

29.99 

G 

26 . 25 to 

26,49 

16 

' 


30.00 to 

30.49 

1 5 

26.50 to 

.26,74 

10 



30.50 to 

30.99 

i 4 

26. 75 to 

26,99 

10 



31.00 to 

31.49 

' 1 

27, OO to 

27.24 

11 



31.50 to 

31.99 

0 

27. 25 to 

27.49 

13 



32.00 to 

32.49 

1 

27. 50 to 

27.74 

14 





' 210 

27.75 to 

27.99 

16 





28.00 to 

28.24 

9 


1 




28.25 to 

28.49 

8 






28.50 to 

28.74 

14' ■ 






28.75 to 

28.99 

3 






29.00 to 

29.24 

4 






29.25 to 

29.49 

3 . 






29. 50 to 

29.74 

3 






29.75 to 

29.99. 

. .3 .. 






30.00 to 

30.24 

4 






30.25 to 

30.49 

1 






■ 30.50 to 

30.74 

3 . 






30.76 to 

30,99. 

^ I'..' 






31.00 to 

31.24 

: i 






31.25 to 

31.49, 

0 






31.50 to 

31.74 

0 






3 1.75 to 

31,99 

: 0 : 






32.00 to 

32.24 

„ ' 1: . 



I 
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metry. The structure of each of the other tables is orderly ! 
and approaches more closely a condition of symmetry. Each 
presents the wage data in condensed and compact form, so ! 
that one consulting the tables may learn of the size and ; 
distribution of weekly earnings in the factory in question ' 
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much more readily than by reference to the chaotic collec- 
tion of figures first shown. Such organized collections of 
data are termed frequency distributions, and their purpose, 
as the term implies, is to show in a condensed form the na- 
ture of the distribution of a variable quantity throughout 
the range covered by the x^alues of the variable. The con- 
struction of such a table is the first step to be taken in the 
organization and analysis of quantitative data of the type 
represented above. 

STEPS IN THE CONSTRUCTION OP A FREQUENCY TABEE 

This general introduction to the subject of frequency 
tables has left untouched many important matters in con- 
nection with their construction. It remains to present a 
summary statement of these details. It will be clear that 
the first step here taken, the arrangement of the items in 
order of magnitude, is unnecessary in the actual construc- 
tion of such a table. Ha^dng determined the upper and 
lower limits through an inspection of the data, one has 
but to decide on the number of classes desired, write the 
class-intervals on an appropriate blank sheet, and proceed 
to tally the cases falling in each of the classes thus set off. 
When this process is completed the frequencies are com- 
puted and the totals arranged in tabular form of the type 
illustrated above. These simple operations involve decisions 
on a number of points, however. 

SIZE OP CLASS-INTERVAL 

In deciding upon the size of the class-interval (which is 
equivalent to deciding upon the number of classes) one 
fundamental consideration should be borne in mind, namely, 
that classes should be so arranged that there wall be no 
material departure from an even distribution of cases within 
each class. This arrangement is necessary because, in inter- 
preting the frequency table and in subsequent calculations 
based upon it, the mid-value of each class is taken to repre- 
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sent the values of all cases falling in that class. Thus, it 
basing calculations upon Table 8, it is assumed that the 
22 cases failing between $26.00 and $26.50 may all be repre* 
sented by the mid-value of that class, $26.25. This assump- 
tion will seldom be strictly valid. In the case just cited 
reference to the original figures will show that it is not a 
correct assumption. Absolute accuracy would only be ob- 
tained by having a class for every value represented in the 
original figures. Since condensation is necessary an arrange- 
ment of classes should be secured which will minimize the 
error involved, without transgressing other requirenaents. 
Table 6 furnishes an example of class-intervals too wide 
for the material. 

The requirement which has just been described clearly 
calls for a large number of classes. A second requirement,: 
w'hich ordinarily conflicts with this, is that the number of 
classes should be so determined that an orderly and regular 
sequence of frequencies is secured. If the classification is ! 
too narrow for the data regularity will not be attained ini 
this respect, and a table without structure or order will be I 
secured. Table 9 fails to meet this requirement, as has been ; 
pointed out. It is desirable, also, that the number of classes i 
be hmited in order that the data may be easily manipulated 
and their significance readily grasped. 

A useful procedure for approximating a suitable class- 
interval has been suggested by H. A. Sturges. Given a 
series of N items of known range, a suitable class-interval i 
may be approximated from the formula 

_ Range 
^ “ 1 -b 3.322 log JV' 

The specific figure secured in a given instance is likely to 
be a fractional value, quite unsuited to actual use. An 
appropriate round number close to the theoretical value, 
may be chosen.^ Thus, in the example cited above, with a 

^ This formula, and the justification for its use, 'are discussed in ^‘Tlie Choice 
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range of $9. 15 and N equal to 210, the use f»f a class-interval 
of 11.05 is indicated bj’' the formula. The nearest- round 
number, siutable with reference to other considerations as 
well, is $1.00. Table 7, in which 1hi.s clas.s-inter\-al is em- 
ployed, seems to conform most thoroughly to all the re- 
quirements we have set forth. 

LOCATION OF CLASS LIMII’S 

The location of class limits is a matter of considerable 
importance, for attention to this matter will simplify tab- 
ulation and facilitate later calculation. Tabulation of data 
is easiest when class limits are integers and tlie class-interval 
itself is a whole number. Calculation of averages and other 
statistical measures is facilitated when the mid- values of 
classes are integers. Suitable class limits and mid-points 
are usually secured when the data permit class-intervals of 5 
or multiples of 5 to be employed, though such an arrange- 
ment is by no means essential. 

Some types of data show a tendency to duster or con- 
centrate about certain values on the scale along which they 
are distributed. This is illustrated by the following figures 
which form part of a table showing the number of pieces 
of commercial paper discounted by the Federal Reserve. 
Banks in 1921, distributed according to rates of discount 
or interest charged by member banks: 


(per cent) 

Number of pieces 

6 

18,970 


697 

6i 

4,616 

61 

135 

7 

17,362 

7i 

10 


of a Cla^ Interval” by Herbert A. Sturges, Journal of the American Statiatiaal 
Association, March, 1926, 66-6. The use of the fomula rests on the assumption 
that the proper distribution into classes is given, for all numbers which are 
powem of 2, by a series of binomial coefficients. The relation of the terms in 
the binomial expansion to the theory of frequency distributions is discussed 
below, in Chapter XIII. 
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Here is a, quite obvious bimching about the integers, -with 
a secondary concentration at each half of one per cent. 
No cases at all fall between the quarter values here shown. 
It is clear that in classifying such data the mid-points of 
the various classes should fall at those values about which 
the cases are concentrated, and class limits must be located 
with this end in view. For, as noted above, calculations 
based upon the frequency table are performed upon the 
assumption that all the items in each class are concentrated 
at the mid-point of that class. Thus, if a class interval of 
one half of one per cent were selected in the above example, 
the classes should extend from 5| to (but not including) 
6 i, 6 | to 6 |, etc., rather than from 6 to 65 , 6 J to 7, etc. 


ACCURACY OF OBSERVATIONS AND THE DEFINITION 
OF CLASSES 

In the construction of frequency tables it is essential 
that there be a clear definition of classes, so that there may 
be no uncertainty as to their range and no question as to 
the precise class in which a given case falls. A table with 
an arrangement similar to the following is sometimes en- 
countered: 

Class-interval 

0 to 10 
10 to 20 
20 to 30 
30 to 40 
40 to 50 


Frequency 

3 

8 

15 

6 

2 


In the absence of explanation, a question arises at once as 
to whether a case with a value of 10 would fall in the first 
or in the second class. It is highly desirable that the range 
of each class be indicated in some such way as the following, 
in order that this ambiguity may be avoided: 
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Class-interval 

Frequrnr}! 

Oto 9.9 

3 

10 to 19.9 

8 

20 to 29.9 

15 

30 to 39.9 

6 

40 to 49.9 

2 


This procedure solves the difficultj^ however, only in case 
the observations are accurate to the nearest tenth. If the 
observations are accurate only to the nearest unit (that is, 
if the cases recorded as having a value of 10 actually lie 
between 9.5 and 10.5) a mere change in the description of 
the class range does not solve the problem of allocating a 
case at the class limit. In such a case an observation falling 
at a class boundary may be cut in two, one half being allo- 
cated to each of the adjacent classes. 

Yule lays down the tiseful principle that in fixing a class 
boundary the limit should be carried to a farther place in 
decimals, or a smaller fraction, than the values of the indi- 
vidual cases as originally recorded. Thus, in the preceding 
example, if observations were correct to the nearest tenth, 
it would mean that a value recorded as 9.9 actually lay 
between 9 . 85 and 9 . 95. In accurately describing the classes, 
therefore, the intervals should be given as 0 to 9.95, 9.95 
to 19.95, etc. (Since the observations to be tabulated are 
recorded only to the first decimal place no ambiguity arises 
from the apparent over-lapping of these class limits.) It 
should be noted that the values of the mid-points, with 
these class linaits, would be 4.95, 14.95, etc. In presenting 
and using the table as given above the real meaning of the 
class limits should be borne in mind. In all cases class 
boundaries must be fixed with reference to the accuracy of 
the observations. 

The work of tabulation is simplified if, in designating a 
class, both limits are stated, as above. Errors are hkely if 
only the lower limit of each class is given, or if the mid- 
point alone is designated. It is desirable, however, par- 
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ticularly if calculations are to be based upon the table, to 
include a sepai’ate column showing the values of the mid- 
points of the various classes. 

OTHER REQUIREMENTS 

Class-intervals should be uniform throughout the table 
in order that all classes may be comparable. Occasionally 
tables are published with varying class-intervals, so that 
on one section of the scale the number of items falling 
within a class having an interval of 5 is given, and on 
another section of the scale the number of items falling 
within a class having a range of 10 is given. Obviously, 
comparison of classes is impossible. It may be desirable 
to show in more detail the cases falling within certain 
ranges on the scale, but this end is best achieved by the 
construction of a supplementary table relating only to the 
cases falling within this restricted section. The utility of 
the main table is not lessened thereby. 

Similar in nature is the i-equirement that there should be 
no indeterminate classes, that is, classes the ranges of which 
are not defined. Had all the individuals making S30 .00 and 
over in the illustration of piece-work earnings been entered 
in a class with the designation “130.00 and over,” the 
upper limit of this class would have been quite mcertain. 
This fault in a table is a vital one when it is desired to base 
calculations upon the data contained in the table. When 
there are several extreme cases the inclusion of such classes 
is sometimes unavoidable, but when this is done the actual 
values of the eases included in such “open end” classes 
should be given in a footnote to the table. 

The errors described in the two preceding paragraphs 
are exemplified in the table on page 62. 

In this case the ranges of the two “open end” classes are 
not known. The ranges of the intermediate classes vary, 
being $5.00 for two classes, $10.00 for one class, $20.00 
for one class and $25 .00 for two classes. Tbo mimr.Hoe a 
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Table 10 

Frequency Distribution of Rented Dwellings in Rmo, Nevada, 1934 
(Classified on the basis of rental value ') 

Numhr of •hvitllinijs 


Monthly rental 

. . in etH'k i 

{Jreynni 

Under $10.00. 

327 

$10.00 to $14.99 

349 

15.00to 19,99 ■ 

521 

20.00 to, 29.99 

1,039 

30.00 to .49.99 . 

1,075 

50.00 to 74.99 

189 

75.00 to 99.99 

24 

$100. 00 and over 

9 





special investigation may sometimes be served by the use 
of such a form, but a table of this type is poorly adapted to 
the requirements of statistical calculation. 

The Stetjcture op Statistical Tables 

The preceding discussion has been confined to certain 
more or less technical problems which arise in the construc- 
tion of a frequency table. Nothing has been said directly 
as to the form of the completed table, the arrangement of 
columns and rows, the title, the notation. No general prin- 
ciples of tabular arrangement have been laid down. While 
no detailed treatment of these principles is possible within 
the scope of the present discussion, certain general consid- 
erations relating to the structure of statistical tables may be 
suggested. 

The statistical table is merely a device for presenting in 
summary fashion a mass of quantitative data. Unless the 
summary be clear, significant, concise, and readily inter- 
preted nothing has been gained by the process of tabulation 

^ The table is taken from Real Property Inventory^ 1934* Summary and Sixty- 
Four Cities Combmedy Department of Commerce, Washington. Figures for 
253 rented dwellings in Reno were not reported. 
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and classification. A spra,wling, formless table is like a 
rambling, unintelligible discourse. There must be a purpose 
in back of each table, and this purpose should be clearly 
bi ought out in its arrangement. The means by which this 
pui-pose may be attained in a given ease must be deter- 
mined with reference to the specific conditions affecting that 
case, but standard practices should be followed, in so far 
as possible. The following general principles will be found 
helpful in deciding upon the form and arrangement of sta- 
tistical tables : 

1. I he title should constitute a clear, concise and complete 

description of the material assembled in the table. 

2. Headings of columns and rows should be concise and unambigu- 

ous. 

d* aiiable c|iiaiititie.s should increase from left to right and 
^ from top to bottom, when such arrangement is feasible. 

4. Columns and rows may be numbered to facilitate reference to 

the table. 

5. The units of measurement employed should be clearly indicated. 

6. Sources should be given in all cases. 

7. The table .should constitute a unit, self-sufficient and self- 

explanatory, All e.xplanations necessary for its interpre- 
tation should be included as integral parts of the table, or in 
the form of footnotes. 

Graphic Representation op Frequency Distributions 

Frequency distributions of the type illustrated above serve 
a very important statistical function in presenting a compact 
summary of data, and in preparing these data for further 
manipulation. Such distributions may be presented not 
only in tabular form, but graphically, utilizing the general 
principles of the coordinate system which were explained 
above. Many of the characteristic features of a frequency 
distribution are most clearly revealed when the graphic 
method is adopted. 

Table 6, presenting the weekly earnings of 210 employees, 
with a class-interval of two dollars, is depicted graphically 



in Fig. 23. In this figure class-intervals are plotted along 
the x-axis and the corresponding class-frequencies along the 
y-axis, appropriate scales being selected. The fact should 
be noted that the scale of abscissas starts not with zero, 
but with $20. For convenience in presentation that part 
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Fig. 23. — Column Diagram: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-inter\ml = $2.00) 

of the scale extending from 0 to $20 is omitted. The student 
should bear this in mind in seeking to secure a correct 
impression of the relations between the two variables plotted. 
In constructing such a figure, which is termed a colu7nn 
diagram or histogram, short horizontal lines are drawn con- 
necting the points plotted to represent the upper and lower 
limits of each class-interval. In interpreting this diagram 
it should be noted that the areas of the different rectangles 
are proportional to the number of cases represented, the 
total area representing the entire 210 cases. This device 
thus presents to the eye a very clear picture of the distribu- 
tion, showing quite unmistakably the relative number of 
workers falling in each of the wage classes. 

The classes in this case are so large, however, that some 
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violence is done to the facts. So many details are lost 
that a true conception of the disposition of the items is not 
given. Fig. 24 is a histogram depicting the distribution of 
cases when a class-interval of one dollar is used. In this 
case, with smaller steps, we approach more closely an orderly 
and symmetrical distribution. The same is true of Fig. 25 
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Fig. 24. — Column Diagram: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-intervml = $1.00) 

which shows the distribution when the class-interval is 
fifty cents. The distribution represented in Fig. 26 has a 
class-interval of twenty-five cents which, as has been pointed 
out, is too narrow for the data, with the result that a quite 
irregular structure is secured. (It should be noted that 
the vertical scale is not the same in these four figures, so 
that comparison with respect to class-frequencies is only 
possible by reference to the scale figures.) 



Frequency 
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Fig. 25. — Column Diagram: Distribution of 210 Emplo 5 "ees Classified 
on the Basis of Weekly Earnings (Class-interval == S. 50) 
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Fig. 26. — Column Diagram: Distribution of 210 Employees Classified 


on the Basis of Weekly Earnings (Class-interval = S . 25) 





on the Basis of Weekly Earnings (Class-interval = $2.00) 



Fig. 28. — Frequency Polygon: 'Distribution' of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = $1.00) 




Frequency polygons corresponding to the histograms of 
Figs. 23, 24, and 25 are shown in Figs. 27, 28, and 29. 
Each of these polygons has been constructed by plotting as 
abscissas the mid-points of the class-intervals, and as ordi- 
nates the class-frequencies, the points thus secured being 



Fig. 29. — Frequency Polygon: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = $.50) 


connected by a broken line. In completing such a figure 
the class next below the lowest one on the scale and the 
class next above the highest one on the scale are included, 
the class-frequency being zero in each case. The ends of 
the polygon thus coimect with the base Hhp at w,;.4 
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points of these two extra classes. In the case of the fre- 
quency polygon the entire area under the curve represents 
the entire nunaber of cases, but the area of a given interval 
cannot be taken to be proportional to the number of cases 
in that interval, because of irregularities in the distribution 
on either side of the given class. The heights of the ordi- 
nates at the Eodd-points of the various classes are, of course, 
scaled to represent the class-frequencies. 

The Smoothing op Curves 

Attention is again called to the results secured with 
varying class-intervals. As the class-interval is decreased, 
up to a certain point, the histograms and polygons become 
smoother and more regular. Beyond that point breaks 
begin to appear in the data; the regular change in class- 
frequencies which was found when the classes were larger 
is broken by the appearance of irregular classes which 
seem to depart from the general rule. In Fig. 25 these 
have become quite pronounced. Such irregularities, it is 
obvious, are exceptions to a general rule which seems to 
prevail, the general rule that the numbers of workers falling 
within the different wage classes increase from the lower 
limit of earnings up to a maximum in the neighborhood of 
127.50, and then decrease till, at the upper limit of $32, 
but one worker is found. Since all the 210 individuals are 
engaged in the same wmrk, and since their earnings depend 
only upon their rapidity and skill, one would expect a quite 
regular increase and decrease. If we had figures not for 
one week only, but for 52 weeks, and took the average 
weekly earnings of each of the 210 workers for the year, 
we should expect greater regularity with the smaller class- 
intervals than is actually found, since the accidental fluc- 
tuations peculiar to one week alone would thus be ehnai- 
nated. Or, if we had earnings during one week for 10,920 
workers (52 times 210). the same result wnnW Ko 



70 


STATISTICAL DATA 


essential not only to decrease the size of the classes but 
also to increase the number of cases, in order that the 
accidental irregularities which affect a small number of 
observations may be eliminated. A refined classification 
with a small number of cases leads to the condition exempli- 
fied in Fig. 26. But such an increase in the number of cases 
is, in general, a practical impossibility. We wish, if pos- 
sible, to develop a feasible method of approximating the 
distribution which would be secured with very small class- 
intervals and a very large number of cases. Such an approxi- 
mation is possible through the device of curv’-e-smoothmg. 
By this method we may secure a smooth Jreqmncy curve 
which lacks the irregularities occasioned by minor fluctu- 
ations. 

Such a smooth frequency curve serves to represent the 
true underlying distribution of the data. It was pointed 
out that areas in the frequency polygon are not propor- 
tional to the number of cases included, the cause lying in 
the irregularities of the data. In a smoothed frequency 
curve these irregularities have been ehminated, and the 
area between ordinates erected at given points on the scale 
of abscissas is assumed to be proportional to the theoretical 
frequency of cases between the given values. Moreover, a 
smooth trend having been established, frequencies for in- 
termediate values not shown in the original table may be 
determined by interpolation.^ 

The following data,^ representing the distribution in 1918 
of personal incomes below 14,000, will serve to exemplify 
the smoothing process. 

' The limitations of practical statistical work are such that there must of 
necessity be many gaps in the data. The giyen values of the variables are not 
continuous. Interpolation is the process of estimating values of a variable 
quantity between given values, or of locating a point on a curve between given 
points. That interpolation is most accurate which leads to estimated values 
having the highest degree of consistency with the given values. 

“ I^rom Vol. I, JncoTHG in th& United StoitBS^ National Hiirpao of 
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Table 11 

Distribution of Income among Personal Income Recipients in 1918 
(Including all personal incomes below s$4,000) 


I ncome class ^ 

Number of persom 

$ 0 to $100 

62,809 

100 to 200 

103,704 

200 to 300 

209,087 

300 to 400 

489,963 

400 to oOO 

961,991 

500 to 600 

1,549,974 

600 to 700 

2,154,474 

700 to 800 

2,668,466 

800 to 900 

3,013,034 

900 to 1,000 

3,144,722 

1,000 to 1,100 

3,074,351 

1,100 to 1,200 

2,850,526 

1,200 to 1,300 

2,535,285 

1,300 to 1,400 

2,205,728 

1,400 to 1,500 

1,832,230 

1,500 to 1,600 

1,512,649 

1,600 to 1,700 

1,234,397 

1,700 to 1,800 

999,996 

1,800 to 1,900 

811,236 

1,900 to 2,000 

663,789 

2,000 to 2,100 

549,787 

2,100 to 2,200 

' 463,222 

2,200 to 2,300 

395,115 

2,300 to 2,400 

. 340,141 

2,400 to 2,500 

295,490 

2,500 to 2,600 

258,650 

2,600 to 2,700 

227,731 

2,700 to 2,800 

201,488 

2,800 to 2, 900 

178,901 

2,900 to 3, 000 

154,499 

3,000 to 3,100 

142,802 

3,100 to 3,200 

128,217 

3,200 to 3,300 

115,583 

3,300 to 3,400 

. 104,504. . 


^Tbe definition of classes, used is' equivalent- to '*^$0 to and. not including 
$100/’ etc. Thus an individual with an income of $100 would fall in the second 
class. 

^ The National Bureau’s report states '‘The numbers below are given to the 
nearest unit. It is not pretended: that isuch aiithmetic accuracy is anything 
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Table 11 — Coniinued 


Distribution of Income among Personal Income Recipients in 1918 


Income class 
$3,400 to 3,500 
3,500 to 3,600 
3,600 to 3,700 
3,700 to 3,800 
3,800 to 3,900 
3,900 to 4,000 


Number of persons 
$94,803 
86,405 
79,023 
72,562 
66,900 
61,894 


Figures 30, 31, and 32 present column diagrams of these 
income data, grouped with class-intervals of $500, $200, and 
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Fig. 30. Column Diagram: Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Incomes below 
$4,000 (Class-interval == $500) 

$100. As the class-interval is decreased the histogranas be- 
come more regular and uniform, but our original data 
permit us to carry this process only to the point where the 
class-interval is $100. Our problem is to determine the 
underlying distribution which the data approximate more 
and more closely as the : class-interval is lessened. If we 
replace the broken line of the histogram by a smooth curve 
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5.000. 000 
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Fig. 31. Column Diagram: Distribution of Personal Income Eecipients 
in the United States, 1918. Including all Recipients of Incomes below 
$4,000 (Class-interval = $200) 
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Fig. 32. ~ Column Diagram: Distribution of Personal Income Recipients 
in the United States, 1918.' Including all Recipients of Incomes below 
$4,000 (Class-interval ~ $100) \ 
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enclosing the same total area as the histogram and so drawn 
through the points of the histogram that the area cut from 
each rectangle is approximately equal to the area added to 
the same rectangle by the curve, we will have a frequency 



Dollars 

Fig. 33. — Frequency Cur\^e: Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Income.s below 
$4,000 (Derived from the column diagram with class-interval of $100) 

curve representing the d^ired distribution. The require- 
ment that the same total area be enclosed is fundamental. 
Exceptions to the rule concerning the area of individual 
rectangles will frequently occur because of the existence of 
quite irregular classes, but as a general working principle 
it is helpful. (More refined methods of fitting a smooth 
curve to data will be discussed at a later point, but a process 
of smoothing by inspection such as that described above 
gives a fairly close approximation to the required curve.) 

Figure 33 illustrates the result of smoothing the histo- 
gram of income distribution shown in Fig. 32. Here the 
quite artificial jumps between income classes are smoothed 
out, and W'e secui'e the graduation by infinitesimal incre- 
ments which we should expect to find when the incomes of 
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so many millions of persons are included. Here we have 
that which we desired — an approximation to the true un- 
derlying distribution, with the sharp breaks resulting from 
the method of classification eliminated. 

CONTINUOUS AND DISCRETE SERIES 

The logical validity of the smoothing process is dependent 
upon the natm’e of the data being manipulated. From 
this point of view frequency series of the type discussed 
above may be divided into two classes, contimious series 
and non-continuous or discrete series. A continuous series is 
one in which the values of the independent variable in- 
crease or decrease by increments which are mfinitely small. 
A discrete series is one in which the phenomena represented 
by the independent variable always change in value by 
definite amounts. The curve of underlying values rises not 
smoothly, as for the continuous series, but by jumps. 

The fact should be emphasized that in making this dis- 
tinction we are speaking of the values as they w'ould be found 
in the underlying universe of phenomena from which the ac- 
tual bodies of material we study are drawn. Any given 
sample, whether representing continuous or discrete series, 
will be marked by breaks in the values of the independent 
variable. This will be true, in the case of a continuous 
series, because of the limitations upon the instruments and 
senses we use in measuring. Thus if the heights of lOO men 
be measured, the independent variable of the frequency 
series (height) will increase by finite amounts. We may 
measure to the nearest inch, or perhaps to the nearest 
eighth of an inch. Yet if ten thousand or ten million men 
were arranged in order of height the differences between 
successive individuals would be infinitely small. Height is a 
continuous variable, even though the values found in a given 
sample are marked by discontinuity. 

Quite different is the distribution of such a variable as 
interest or discount rates. If one were to secure 100 such 
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quotations and rank them in the order of size the varia- 
tions would be discontinuous, as in the sample of men 
whose heights were measured. But in the case of heights 
the underlying values, if they could be determined for a 
large population, would be marked by continuous varia- 
tion, whereas, were an infinite number of discount rate 
quotations secured, there would still be breaks in the se- 
quence. Discount rates increase or decrease by one quarter 
or one half of one per cent, not by infinitesimal amounts. 
Such a series is termed discrete, or non-continuous. 

The smoothing process provides a means of securing an 
approximation to the distribution of values as they would 
be found if a sample could be increased indefinitely in size. 
It is based upon the assumption that the irregularities 
found in the sample actually studied are accidental, and 
that the underlying values would show continuous and un- 
broken variation. Obviously, therefore, it is only fully 
justified when applied to a continuous series. A histogram 
of human heights may be smoothed in order to secure a 
representation of the true underlying distribution in the pop- 
ulation at large, and interpolation based upon this smoothing 
process is valid. But smoothing is quite illogical for a 
markedly discontinuous series. It would be meaningless to 
construct a smooth curve showing the distribution of dis- 
count rates for the purpose of securing the theoretical fre- 
quency of a rate of 4.3675 per cent. In practical statistical 
work, however, it is frequently helpful to handle discrete 
series as though they were continuous, and in these cases 
the smoothing device may be employed. But in the inter- 
pretation and use of the smoothed curve the important 
logical distinction between continuous and discontinuous 
variation should be kept clearly in mind. 

ClTMTJLATIVE ArEANGBMBNT OF STATISTICAL DaTA 

For certain purposes it is desirable to arrange data cumu- 
latively, rather than in separate and exclusive classes of 
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the type illustrated in the frequency tables presented above. 
The following material will illustrate some of the advantages 
of this arrangement. 

In a study of the durability of telephone poles ' these 
results were secured: 


Table 12 

Frequency Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 


Length of life 

Number of poles 

(years) 

(frequency) 

0- 0.9 

1,150 

1- 1.9 

4,221 

2- 2.9 

10,692 

3- 3.9 

13,966 

4- 4.9 

16,633 

5- 5.9 

18,211 

6- 6.9 

19,011 

7- 7.9 

19,260 

8- 8.9 

20,909 

9- 9.9 

19,879 

10-10.9 

20,764 

11-11.9 

15,454 

12-12.9 

14,237 

13-13.9 

13,779 

14-14.9 

9,764 

15-15.9 

8,534 

16-16.9 

7,659 

17-17.9 

6,918 

18-18.9 

4,591 

19-19.9 

1,798 

20-20.9 

815 

21-21.9 

313 

22-22.9 

102 

23-23.9 

47 


The table shows that 1,150 poles were scrapped during 
the first year of use, that 4,221 were scrapped after reaching 
the age of one year and before reaching the age of two 

: . ^ Replacement Insurance/^ Edwin Kurtz. Administration^ , July 
41 - 69 . 
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years, and so on. This is simply a frequency table of the 
ordinary type. A much more significant arrangement for 
many purposes is secured when the figures are assembled 
cumulatively, as in the following table. 

Table 13 

Cumulative Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 


(Ciimuiated upward) 


Length of life 

Number of poles surming 
(frequency) 

Less than. 

1 

3^ear 

1,150 ' 

it 

it 

2 years 

. 5,37.1 

it 

ti 

3 

it 

. ■ ■' 16,063 

it 

it : 

4 

it 

30,029 ' 

(i 

it 

5 

tc 

.46,662. 

(t 

it 

6 

it 

64,873 

{( 

it 

7 

ti 

■ ■ 83,884 

(t 

it 

8 

ti 

103,144 

it 

it 

9 

tt 

. 124,053 

it 

ii 

10 

tt 

■ 143,932 

it 

ti 

11 

tt 

164,696 

ti 

it 

12 

tt 

180,150 

it . 

ti 

13 

tt 

.194,387 

it 

ii 

14 

it 

■ 20S,.i66 . ■ ' 

u 

it 

15 

it 

■'.■21.7,930 ■''■■■■ 

a 

a 

16 

a 

226,464 

tt 

ii 

17 

it 

234,123 

ti 

tt 

18 

it 

241,041 

it 

ti 

19 

tt 

245,632 

ti 

it 

20 

ft 

247,430 

a 

it 

21 

tt 

248,246 

tt 

ti 

22 

it ■ 

248,558 

« 

tt 

23 

tt 

248,660 

tt 

tt 

24 

tt 

248,707 


It is important to note that it is possible to ciimulat-e a 

frequency series in two different ways. From the above 
table we may determine readily the number failing to at- 
tain any given age. It is often more convenient to reverse 
the process, so that the table will enable the total number 
above any given value to be immediately determined. When 
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the telephone pole figures are thus cumulated downward 
the following table is secured. 


Table 14 


Cumulative Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 
(Cumulated downward) 


(1) 

Length of life 

(2) 

Ntmher of poles snrvimig 
{frequency) 

(3) 

Per cent 

0 and 

more 

248,707 

100.0 

1 year 

u 

it 

247,557 

99.5 

2 years 


a 

243,336 

97,8 

3^ 

a 

i< 

232,644 

93.6 

4 

it 

a 

218,678 

88.0 

0 

i( 

it 

202,045 

81.2 

6 

it 

a 

183,834 

73.8 

7 

it 

<< 

164,823 

66.3 

8 

a 

(( 

145,563 

58.5 

9 

a 

a 

124,654 

50.1 

10 

a 

ft 

104,775 

42.1 

11 

a 

it 

84,011 

33.8 

12 

(( 

a 

68,557 

27.6 

13 

(( 

(1 

54,320 

21.8 

14 

<( 

it 

40,541 

16.3 

15 

(( 

it 

30,777 

12.4 

16 ■ 

{( 

a 

22,243 

. 8.9 

17 

ti 

a ' 

14,584 

5.9 

18 

iC 

a 

7,666 

3.1 

19 “ 

(6 

it ■ 

3,075 

1.2 

20 

t ( 

it 

1,277 

0.5 

21 


a 

462 

0.2 

22 

tl 

ti 

149 

0.06 

23 “ ' 

iC 

it 

47 

0.02- 

24 ' “ 

t( 

it 

0 

0.00 


Cumulative tables such as those given above hav’-e dis- 
tinct advantages in the handling of many types of data. Life 
tables are generally presented in this form. The scientific 
study of depreciation will lead to the construction of elab- 
orate “mortality tables” for various types of equipment, 
and these will be most useful in the cumulative form. It 
is frequently desirable to reduce the frequeneies to per- 
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centages, as in column (3) of Table 14, thougb it should 
not be forgotten that the significance of the percentages 
depends upon the absolute numbers upon which they are 
based. 

THE OGIVE, OR CUMULATIVE FREQUENCY CURVE 

The general utility of such cumulated data is limited by 
the classification system necessarily adopted in condensing 

Number 
of Poles 



Fig. 34. — Cumulative Frequency Curve: Distribution of Telephone Poles 
Classified according to Length of Life (Cumulated upward) 

the material. Unless we interpolate mathematically we are 
limited to the points on the scale actually noted in the two 
tables. For this reason, a generalized cumulative curve 
similar to the smoothed frequency curve described in the 
preceding section is desirable. If the values given in Table 13 
be plotted on coordinate paper (the length of life in each 
ease as abscissa, and the corresponding number of poles as 
ordinate) and a smooth curve drawn through the points 
thus plotted, the cumulative frequency curve shown in 
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Pig. 34 is secured. In Fig. 35 the data of Table 14 are 
plotted. 

Such a curve constitutes one of the most effective and 
useful representations of a frequency series. It is obvious 
that the limitations of the particular class-interval adopted 
are in large part removed; the shape of the curve will be 
fimdamentally the same, though the class-interval and num- 

Number 
of Poles 



Length of Life In Years 

Fig. 35. — Cumulative Frequency Curve; Distribution of Telephone Poles 
Classified according to Length of Life (Cumulated downward) 

ber of classes may vary. Frequency curves of the usual 
type may not be compared unless the groupings are the 
same, but cumulative frequency curves are subject to no 
such restriction. Moreover, uneven class-intervals do not 
distort the ogive, or cumulative curve, as they do the ordinary 
frequency curve. 

The cumulative curve is particularly well adapted to 
interpolation. Thus if it is desired to know the number of 
poles surviving less than 15? years, the value of the ordi- 
nate of the curve having 15? as abscissa may be approxi- 
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mated from Fig. 34. A value of 222,000 is secured. If the 
number surviving Sf years or more is desired, a similar 
estimate may be made from Fig. 35. The interpolated 
figure in this case is 135,000. 

Another type of interpolation possible with such a curve 
is the determination of the number of eases falling within 
any given interval. One is not limited to the elas.s-intervals 
marked out in the original tables. For instance, it may be 
desirable to know the number of poles surviting more than 
10^ but less than 15 years. Reading from the table or from 
the chart we find that 217,930 poles suindved less than 
15 years. Interpolating on the chart in the manner de- 
scribed above a figure of 154,000 is secured for the number 
surviving less than lOJ years. Subtracting the latter figure 
from the former we have 63,930 as the number of poles 
falling within the lOJ to 15 years interval. The figure is, 
of course, an approximation to the true value, as are all 
values secured through such smoothing and interpolation. 

It should be noted that the ogive may be derived directly 
from the array, without the formation of a frequency table 
as an intermediate step. This curve, in fact, may be looked 
upon as merely a graphic representation of the array. It 
represents one of the simplest forms of statistical organi- 
zation, as well as one of the most effective methods of 
manipulating quantitative data. 

RELATION BETWEEN THE OGIVE AND THE FREQUENCY 

CURVE 

The ogive and the frequency curve are merely two dif- 
ferent arrangements of precisely the same material, each 
arrangement having certain distinctive advantages. The 
characteristics of each may be more clearly apparent if the 
structural relationship between these two curves is under- 
stood. This relationship is graphically portrayed in Fig. 36.’- 

^ The suggestive arrangement shown in this figure was originated by Robert 

E. Chaddock. 
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Labor Cost per 1000 Feet 

Fia. 36. — Distribution of Sawmills in the United States Giassified ac- 
cording' to' Labor Cost in 1921. Illustrating the Structural Helation 
between the Ogive and the Frequency Curve. 
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This figure is based upon the following frequency table, 
showing the distribution of sawmills in the United States, 
classified on the basis of labor cost per 1,000 feet of lumber 
produced.^ 

Table 15 

Frequency Distribution of 269 Sawmills in the U nited States Classified 
According to Labor Cost in 1921 

Labor cost (aU employees) per Number of establishments 
1,000 feet, board measure (frequency) 


| 1 . 00 -$ 1.49 

3 

1 . 50 - 1.99 

10 

2 . 00 - 2.49 

14 

2 . 50 - 2.99 

22 

3 . 00 - 3.49 

38 

3 . 50 - 3.99 

40 

4 . 00 - 4.49 

38 

4 . 50 - 4.99 

33 

5 . 00 - 5.49 

20 

5 . 50 - 5.99 

11 

6 . 00 - 6.49 

10 

6 . 50 - 6.99 

11 

7 . 00 - 7.49 

8 

7 . 50 - 7.99 

4 

8 . 00 - 8.49 

4 

8 . 50 - 8.99 

3 


269 


The upper part of Fig. 35 indicates the method by which 
the ogive is built up. Just as in the histogram, the area of 
each rectangle is proportional to the number of cases falling 
in the given class. Since the operation is a cumulative 
one, however, the base of each rectangle is the cumulated 
frequencies of all preceding classes. Thus the y-value (fre- 
quency) of the first rectangle is 3, erected from zero as a 
base, the ^/-value of the second class is 10, erected from 3 
as a base, and so on. The slope of the curve connecting 
these rectangles is gradual at first when the frequencies 

^From “Labor Efficiency and Productiveness in Sawmills,” Ethelbert 
Stewart, Monthly Labor fieweui, January, 1923, 14. Seven scattered cases 
above $9 .00 in value have been omitted from the table and the accompanying 
graph. 
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are low, then steeper as the frequencies become greater, 
and finally tapers off as the frequencies decrease near the 
upper limit of the distribution. This is the cumulative 
frequency curve, or ogive. 

When the various rectangles representing the class-fre- 
quencies are dropped to the zero line as a common base, 
the r-values remaining the same throughout, the histogram 
or column diagram described in an earlier section is secured. 
From this the frequency polygon or smoothed frequency 
curve may be derived. 
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CHAPTER IV 


DESCRIPTION OF THE FREQUENCY 
DISTRIBUTION: AVERAGES 

The classification of quantitative data and the construc- 
tion of a frequency distribution constitute an important 
stage in the task of organization and analysis. By means 
of classification the miderlying structure of the data may 
be revealed and the essential unity of a mass of material 
may be brought out. But this is only the first step in statis- 
tical analysis. It remains to develop methods of measuring 
and expressing more concisely the significant characteristics 
of a body of data. For certain purposes the frequency dis- 
tribution itself must be summarized and condensed, must be 
boiled down until its essence has been distilled into three or 
four significant figures. 

If each frequency distribution constituted a novel and 
unique problem, obejdng a law peculiar to itself, the task 
of studying and describing such distributions would be a 
difficult one. Fortunately this is not so. Quantitative 
data in widely different fields, when assembled in frequency 
distributions, show certain common characteristics, obey 
certain general laws. Experience in one field, therefore, 
constitutes a guide to work in others. Uniformity in the 
behavior of masses of data makes possible the development of 
a generalized method of organizing, analyzing and comparing 
measurements drawn from many fields of scientific study. 

Comparison of Frequency Distributions 

This fact of a common law of arrangement running through 
the universe of quantitative facts may be brought home 
most effectively by a comparison of distributions illustrative 
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of various types of data. The characteristics of the frequency 
distributions and of the frequency curves which follow should 
be noted, and the distributions compared. 



Fig. 37. — Frequency Curve: Distribution of 18,780 Soldiers Classified 
according to Height 

The curve in Fig. 37 is based upon the following data 
relating to the heights of 18,780 soldiers. ^ 

Table 16 


Distribution of Soldiers Classified According to Height 


Height in in ches Numher of soldiers 

Height in inches 

Number of soldiers 

60 + : 197 

67 + 

. 3,017 

61 + ,317 ■' 

68 + 

2,287 ■■ V: 

62 + 692 

69 + 

^ 1,599 ' 

63 + 1,289 

64 + 1,961 

70 + 

878 

71 + 

, 520 

65+ ■ 2,613 

72 + 

262 

66 + 2,974 ' 

73 + 

174 

Total 


■ 18,780 

^ From G. Whipple, Vital Statistics^ew VQtk,: W 377. 
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Fig. 38 depicts a frequency curve based upon 1,000 ob- 
servations, made at Greenwich, of the Right Ascension of 
Polaris.^ The values on the abscissa define deviations, in 
seconds of time, from an origin near the mean of all the 
observations. Frequencies of occurrence of given values on 
the x-scale are measured, of coui-se, as ordinates on the 
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Magnitude of Deviation in Seconds of Time 


Fig, 38, — Frequency Curve: Distribution of Errors of Observation In 
Astronomical Measurements 


y-scale. The distribution plotted in Fig, 38 is given in 
Table 17 on page 89. 

If a piece of artillery be accurately adjusted on a given 
target (a point) and 100 shots be fired, it mil be found 
that the points of impact of the hundred shots will be dis- 
persed about the target. No matter how accurate the piece 
or the adjustment only a small percentage of the shots 
will fall upon the exact point at which they were dkected. 
The points of impact will be scattered about the target 
in a quite regular fashion, however. If a rectangle be so 
drawn as to include all the points of impact, and this rec« 


T 1 ^* Whittaker and G. Robinson, The Calculus of Observaiions, 

London, Blackie and Son, 1924, 174. 
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Table 17 

Distribution of Errors of Observation in Astronomical Measurements 


(1000 observations of the Right Ascension of Polaris) 


Magnitude of deviation^ 
in second.s of time, from origin 

Number of obsermtions 

- 3.5 

2 

-3.0 

12 

-2.5 

25 

- 2.0 

43 

-1.5 

74 

- 1.0 

126 

- 0.5 

150 

0 

168 

0.5 

148 • 

1.0 

129 

1.5 

78 

2.0 

S3 

2.5 

10 

3.0 

2 


1,000 


tangle (or zone of dispersion) be divided into eight equal 
parts, the distribution of shots within these sections will be 
as indicated in Fig. 39. (In any given case there are likely 


2, 

7 

16 

25 

25 

16 

7 

2 


Fig. 39. — Zone of Dispersion, Artillery Firing, Showing the Theoretical 
Percentage Distribution of Shots 


to be slight departures from this order, but in the long run 
this distribution will prevail.) 

This general rule holds for all classes of guns. The more 
accurate the gun the smaller will be the zone of dispersion, 
but the distribution within this zone is theoretically the 
same in all cases. Rules of fire used in artillery adjustment 
are based upon this fact. 











90 


AVERAGES 


The results of actual firing may be contrasted with this 
theoretical distribution. Table 18 present.s a record of one 
thousand shots fii'ed from a battery gun at, t he middle of a 
stationary target two hundred yards distant.^ The target 
was divided by horizontal lines into eleven equal divisions. 


Table 18 


Distribution of One Thousatrd Shots from n Single Gun 


Division 

1 (top) 

2 
3 ' 

4 

5 

6 

7 

8 

9 

10 

11 (bottom) 


Number of skofs recorded 
1 

4 . 

10 

89 

190 

212 

204 

193 

79 

16 

2 


1,000 


These results are presented graphically in Fig. 40. 

The zone of dispersion being divided into eleven divisions 
instead of the eight referred to in describing the theoretical 
distribution, a direct comparison cannot be made. We 
have here, however, the same general type of distribution 
found in the other examples given. A tendency tow'ard 
concentration in the lower half of the target reflects a slight 
departure from symmetry. 

When coins are tossed the distribution of heads and tails 
is assumed to be determined by pure chance. In a single 
experiment ten coins were tossed 100 times. The following 
table shows the frequencies with which given numbers of 
heads appeared, (The greatest number of heads possible 

^ This experiment is recorded in the Report of the Chief of Ordnance, 1878 ? 
Appendix S- The results are given in The Method of Least Squares ^ Mansfield 
Meniman, New York, Wiley, 1897 , 14 . 
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in a given throw under such conditions is, of course, 10; 
it is also possible that no heads should appear.) 



Divisions 

Fig. 40. — Column Diagram: Distribution of 1,000 Shots from a Single 

Gun 

Table 19 

Distribution of Results in Coin Tossing Experiment 
(Ten coins tossed 100 times) 


Number of heads 

Frequeyicy of occurrence 

0 

0 

9 

1 

8 

4 

7 ' ■ 

, 7 

6 

23 

5.. 

30 

4 ■ , 

20 

3 ■ 

9 

2 

•5, ' 

1 ■ 

1, . 

O' , 

■ ''O'. 


o 

o 
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Figure 41 depicts the above frequency distribution. 
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the incomes above $4,000 were included.) Several additional 
examples of economic data may be given. 

Figure 42 illustrates the order in which price variations are 
distributed. It is based upon a study made by W. C. Mitchell 
of 5,578 individual cases of change in the wholesale prices 



Percentage of Fall Percentage of Rise 


Fig. 42. — Frequency Polygon: Distribution of 5,540 Cases of Change 
in the Wliolesale Prices of Commodities from One Year to the Next 
(after Mitchell) 

of commodities from one year to the next.^ Thus, for ex- 
ample, the average price of middling upland cotton in 
New York in a given year was $0,115 per pound. In the 
following year the average price was $0,128 per pound, an 
increase of 11.3 per cent. This would constitute one entry 
in the table of rising prices, falhngin the class “ 10-11.9%.” 
The entire table consists of 5,578 such entries. These data 
are presented in Fig. 42 in the form of a frequency polygon, 
no attempt being made to smooth the curve. 

^ From Bulletin Wh U. S. Bureau of Labor Statistics, Part I, “ The /Making 
and Using of Index Numbers/'^ 18.' The figure shows ' the price changes only 
within the range of a 51 per: cent fall and a 51 per cent rise. One' case of a 
price fall of 55 per^ cent is not. shown, and 37 cases of price increases ranging 
from 52 per cent to 104 per cent have not been/ included. , , 
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Table 20 shows the distribution of London-New York 
exchange rates (sterling exchange) from ],SS2 to 1913 in 
elusive This was a period when both currencies were freely 
convertible into gold at fixed ratios, with customary marS 
Ws operating to keep exchange rates bet weei; the two 
gold points. Observations covering recent decades would 
show quite different charaetertatke In the tihwbu 1 
shown graphically in Rg. 43 „„„thly rate have been 
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Frequency 


Table 20 

Distribution of London-New York Exchange Rates as Recorded by 

Months during the Period 1882-1913 

Frequency 

{number of rmnihs given 
rate prevailed) 

1 
6 
11 
21 

23 

24 

25 
40 
45 
49 
35 
45 ■ 

33 
16 

8 
1 

384 



8 16 24 32: 40 48 56 64 72 80 88 96 104112120 
Class-Interval (in dollars per week) 


Fig. 44. — Distribution of Wage-Earners in Open-Hearth Furnaces, 
Classified according to Average Weekly Earnings in 1935 

m 


Class-interml 


$4.8275-$4.8324 
4.8325- 4.8374 
4.8375- 4.8424 
4.8425- 4.8474 
4.8475- 4.8524 
4.8525- 4.8574 
4.8575- 4.8624 
4.8625- 4.8674 
4.8675- 4.8724 
4.8725- 4.8774 
4.8775- 4.8824 
4.8825- 4.8874 
4.8875- 4.8924 
4.8925- 4.8974 
4.8975- 4.9024 
4.9025- 4.9074 
4.9075- 4.9124 
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States, m 1935. There is a clear eoncentralion of workers 
whose earnings fall between $16 and 124 a week. The dis- 
nbution is markedly skewed, however, with a tail extending 
far to the right. The range of weekly earnings, like that of 
mcomes m general, is far greater above tL mode In 

Table 21 

Otolea, Omnfed According to Average WeeUg Earnings in 1935 
(Total for all districts) 


Class-interval 
(in dollars 
per week) 

$ 0- S 7.99 

15.99 

23.99 

31.99 

39.99 

47.99 

55.99 

63.99 

71.99 

79.99 

87.99 

95.99 


8 — 

16- 

24 - 

32- 

40- 

48- 

56- 

64 - 

72- 

80- 

88 - 


96- 103.99 
104- 111.99 
112- 119.99 
120- 127.99 
128- 135,99 


Freqmncfi 
iymmher of ivorkers 
earning dated amount:) 
, 583 
2,200 
4,462 
3,032 
1,527 
764 
358 
210 
144 
44 
36 
2! 

26 
^ 3 
' 7 

■ 1 

13,427 
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between the measurements drawn from the fields of eco- 
nomics, astronomy, anthropometry, ballistics, and pure 
chance. Certain of the common characteristics may be noted. 

General Chaeacteristics op Frequency Distributions 

There is, in the first place, variation in the values of the 
measurements secured. Human heights vary, astronomical 
measurements of the same quantity differ, projectiles fired 
under conditions as nearly constant as it is humanly possible 
to make them fail to land at the same spot, incomes vary 
as between individuals, and exchange rates move from week 
to week and month to month. The various observations or 
values secured in a given case are distributed along a scale, 
between two extreme values. 

The distribution of these values along the scale (the 
rr-axis) is such that, moving from one extreme value to- 
wards the other, the cases found at successive points along 
the scale (the successive class frequencies) increase with 
more or less regularity up to a maximum, and then de- 
crease in much the same way. In spite of variation, there- 
fore, we find a central tendency, a massing of cases at certain 
points on the scale of values. This is the second notable 
characteristic which all the frequency distributions appear 
to possess in common. 

If we measure, for each of the successive classes, the 
amount of deviation along the scale from the point of 
greatest concentration it will be noted that small deviations 
are much more frequent than large ones, that extreme 
deviations are rai-e, and that deviations on both sides of 
the point of concentration reach perfect (or almost perfect) 
equality in the examples taken from the physical sciences 
and from the field of pure chance, and approximate equality 
in the economic distributions. (Exceptions to this rule 
of approximate equality on the two sides of the point of 
greatest concentration are not infrequent, the example of 
income distribution being a rather striking ease in point.) 
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Figure 46 depicts a curve which is termed the “proba- 
bility curve,” or the “normal curve of error.” Its charac- 
teristics will be discussed in greater detail in a later section. 
At this point it is presented merely as a ba.sic type which 
some of the above examples approach closely, and from 
which others of the examples represent more or less pro- 
nounced deviations. Departures from this type, let it be 
emphasized, are numerous and significant, but as a basic 



form this normal curve of error is extremely important in 
statistical work. Even the most important variations from 
this type resemble it with sufficient closeness to justify the 
use of a generalized method of describing frequency distri- 
butions. Distributions of quantitative data vary, and their 
variations from each other and from certain standard types 
are of the greatest significance, but in spite of their varia- 
tions a family resemblance runs through them all. Each 
new frequency distribution is not an isolated phenomenon, 
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but a member of a large family, and as such the problem 
of describing and analyzing it may be approached "with 
confidence in methods which have been found applicable in 
other cases. 

Given this more or less common type, how may a given dis- 
tribution be described and differentiated from others? Certain 
methods will have been suggested by the preceding discussion. 

Methods of Describing a Frequency Distribution 

The values of all the observations, it has been noted, are 
spread along a scale. The frequency distribution may be 
described by the selection of a single value on that scale 
which is thoroughly representative of the distribution as a 
whole. Since the frequencies vary, an obvious choice is 
the selection of that value which occurs the greatest number 
of times, or, in other words, that point on the scale at 
which the concentration is greatest. This value consti- 
tutes a measure of the central tendency of the distribution. 
Thus, one might find the income class in which the greatest 
number of people fall, and let the mid-point of that class 
(which is $950 in the distribution presented in Table 11) 
serve as the representative of the distribution. This most 
common value, it should be noted, is only one of several 
possible measures of the central tendency of a given dis- 
tribution. All such measures are termed averages. 

A single representative value such as this has many uses 
but, by itself, it obviously leaves out many facts concern- 
ing the distribution. Of great importance is the character 
of the distribution about the average. Are the values of aU 
tabulated cases closely concentrated, or is there pronounced 
dispersion over a wide range? The representative character 
of any average depends upon how closely the other values 
cling to it, upon the degree of concentration about the 
central tendency. The average, therefore, must be supple- 
mented by a measure of mnoifon, a measure of the “scatter” 
about the central value. 
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An adequate description should include also an account 
of the degree of symmetry of the distribution. It is highly 
important to know whether there is an equal distribution 
of cases on each side of the point of greatest concentration, 
or whether the frequency curve is skewed to one side, as in 
the case of income distribution illustrated above. If the 
curve is not symmetrical the degree of asymmetry should 
be determined, and for this purpose measures of skeicness 
have been developed. 

It is, finally, possible to measure the degree of peaked- 
ness of frequency curves, by comparing them with the 
normal curve of error as a standard. It is obvious that 
the frequency polygon representing price changes (Pig. 41) 
would, if smoothed, constitute a curve much more peaked 
than the normal curve, and this fact of pronounced con- 
centration at the central value is highly significant. This 
characteristic of frequency curves is called kurtosis, and 
the measurement of kurtosis constitutes the final step in 
the description of the frequency distribution. 

When these various measures have been secured the task 
of statistical analysis will be well under way. The chaotic 
assortment of data with which we started will have been 
reduced to workable form in the shape of a frequency table, 
and the essential facts which the table reveals will have 
been distilled into three or four significant measures. This 
process not only reveals the characteristics of the given 
distribution, but also facilitates comparison with similar 
distributions. For example, it is impossible to compare 
some tens of millions of unorganized personal income figures 
for the United States with similar data for Great Britain. 
But if we secure a value for the average or most repre- 
sentative income for each country, together with a descrip- 
tion of the distribution of personal incomes about that 
central value, a legitimate basis for comparative study is 
obtained. In manipulating and analyzing masses of ma- 
terial, whatever the purpose of study may be, full use 
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should be made of the power to condense, simplify and 
compare which is given by the measures employed in de- 
scribing the frequency distribution. 

The succeeding section is devoted to a discussion of one 
phase of this descriptive process, that concerned with the 
measurement of central tendencies. After the development 
of this subject of averages, problems relating to measures 
of variation and of skewness will be dealt with. 

Averages 

We have seen that the representation of a frequency 
distribution by an average, a single typical figure, is justi- 
fied because of the tendency of large masses of figures to 
cluster about a central value, from which the values of all 
observed cases depart with more or less regularity and 
smoothness. It is solely because of the concentration of 
cases about a central point on the scale that such repre- 
sentative figures have significance. The average represents 
the distribution as a whole only because it is a typical 
value. If the individual items entering into a distribution 
vary widely in value and show no tendency toward con- 
centration, no single value can represent them. Thus the 
arithmetic mean of the three numbers 3, 125, 1,000 is 376, 
but 376 in no way represents the three values on which it 
is based. This fundamental requirement, that there be a 
tendency toward concentration about a central value, must 
be met if an average is to be at all representative. 

If the general character of a frequency distribution be 
recalled the logic of one sort of average will be clear at 
once. It was suggested above that that point on the r-scale 
at which the concentration is greatest, that value which 
occurs the greatest number of times, might be taken as 
typical of the entire distribution. This value is termed 
the mode, and the group in which it falls is called the modal 
group. If a frequency curve be drawn to represent a given 
distribution, the mode will be the x-value corresponding to 
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the maximum ordinate.^ The maximum ordinate itself meas- 
ures the frequency of the modal group. Students frequently 
confuse these two values in determining the mode. It is 
not the distance along the ?/-scale but the distance along 
the a:-seale which measures the value of the mode. The 
ordinates merely measure the number of cases falling in the 
several classes, not the values of the cases falling in those 
classes. 

As typical of a given distribution we might also select 
that point on the scale of ar-values on each side of which 
one half the total number of eases fall. This value, which 
is called the median, is that which exceeds the values of 
one half the cases included, and is in turn exceeded by the 
values of one half the cases. Thus it has been estimated 
that in 1918 the median value of personal incomes in the 
United States was $1,140; one half of the 37 million recipi- 
ents of personal incomes received less than this sum, while 
one half received more. When a distribution is represented 
by a frequency curve, the area under the curve is divided 
into two equal parts by an ordinate erected at that point 
on the a;-axis corresponding to the median value. This 
follows, of course, from the definition of the median, and 
from the fact that the area under a frequency curve repre- 
sents the total number of cases included in the distribution. 

The arithmetic mean is a third type of average which 
may be used to represent a distribution. This is a calcu- 
lated average, affected by the value of every item in the 
distribution. Herein, obviously, it differs from the mode 
and the median, which depend primarily upon the relative 
position of the items in the frequency table, and are not 
affected by the values of all individual items. The arith- 
metic mean is the center of gravity of a distribution; it 
would be the x-value of the point of balance of a frequency 

' Strictly speaking, the mode is the a!-value corresponding to the maximum 
ordinate oi the ideal frequency curve which has been fitted to the given distri- 
bution. 
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curve, if the crirve could be blocked out and manipulated 
in solid form. 

The geometric mean and the harmonic mean are two other 
averages the characteristics of which will be discussed at a 
later point. 

The computation or location of these various averages 
may involve somewhat lengthy processes if the number of 
cases included be great. If appropriate methods be em- 
ployed, however, the labor of computation may be materi- 
ally cut down. The use of the following symbols will sim- 
plify the explanation of these methods: 

M: Arithmetic mean. 

Mo: Mode. 

Md: Median. 

m: The value of an individual observation; in a fre- 
quency distribution, the value of the midpoint of 
a class. 

/■■ The number of items (observations) in a given class 
in a frequency distribution. 

N: The total number of items in a given series or fre- 
quency distribution. 

S (Sigma) : The symbol for the process of summation, meaning 
“the sum of. ” 


The COMPITTATION OP THE ARITHMETIC MeAN 

Using the above notation, the formula for the arithmetic 
mean is 

^ ~ n' 

Thus the mean of the measures 2, 5, 6, 7, is equal to the 

sum of these measures divided by 4, which is ^ or 5. The 

computation of the arithmetic mean when each measure is 
reported at its true value is thus a simple process of sum- 
mation and division. The weekly earnings of 210 factory 
employees were listed in an earlier section. If these figures 
be added, and the total divided by 210, the mean weekly 
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wage is found to be $26,983. In this case the task of add- 
ing 210 items is somewhat tedious; it is a task which would 
become almost impossible if one were dealing with the 
37 million personal income figures, for example. For prac- 
tical reasons, therefore, it is usually necessary to compute 
the required averages from the frequency distribution rather 
than from the original ungrouped data. To exemplify this 
process we may utilize data relating to the weekly earnings 
of steel workers in the Pittsburgh District in 1935. 

The importance of certain of the precautions mentioned in 
the section on classification, in connection with the choice 
of a class-interval, will be clear from this example. When 
the mean of a distribution is calculated from classified ob- 
servations, we must assmne an even distribution of cases 
within each class. The class-interval should be selected 
with this in mind, in order that errors introduced by the 
assumption may be minimized. If the items in each class 
are evenly distributed, the mid-value of each class may be 
taken as representative of all the observations included; 
when such a mid- value is multiplied by the number of items 
in the class, the product is approximately equal to the sum 
of all the individual itenos in the class. The formula for the 

mean thus becomes M = Table 22 illustrates the 

procedure in detail. 

The value secured in this way is sometimes called a 
weighted arithmetic mean. What we do, in effect, is to 
secure the arithmetic mean of the 28 figures in the column 
headed m. We do not take a simple average of these fig- 
ures, however, hni, weight each one in proportion to the 
number of cases falling in the class-interval of which it is 
the mid-value. It is precisely the procedure we should fol- 
low in calculating the mean of five men’s incomes, two of 
whom, let us say, have incomes of $2,000 and three of whom 
have incomes of $3,000. Clearly it would not do to add the 
figures $2,000 and $3,000, dividing the sum by two. The 
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Table 22 

1 


Calculation of the Arithmetic Mean of Weehly Earnings of Worhers 

in Open-Hearth Furnaces in the Pittsburgh District in 1935 

Classnnterval 

Mid-point 

Frequency 

Jm 

(in dollars per week) 

m 

f 

$ 0-$ 3.99 

2 

67 

134 

4- 7.99 

6 

290 

1,740 

8~ 11.99 

10 

437 

4,370 ' : 

' 12- 15.99 

14 

730 

10,220 

16- 19.99 

18 

1,056 

19,008 

20- 23.99 

22 

1,009 

22,198 

24- 27.99 

26 

712 

18,512 

28- 31.99 

30 

609 

18,270 

32- 35.99 

34 

334 

11,356 

36- 39.99 

38 

187 

7,106 

40- 43.99 

42 

179 

7,518 

44- 47.99 

46 

105 

4,830 

48- 51.99 

50 

60 

3,000 

52- 55.99 

54 

67 

3,618 

56- 59.99 

58 

28 

1,624 

60- 63.99 

62 

37 

2,294 

64- 67.99 

66 

33 

2,178 

; 68- 71.99 

70 

29 

2,030 

1 72- 75.99 

74 

16 

1,184 ‘ 

76- 79.99 

78 

8 

624 

80- 83.99 

82 

3 

246 

84- 87.99 

86 

8 

688 

88- 91.99 

90 

4 

-^360 

92- 95.99 

94 

7 

658 

96- 99.99 

98 

9 

882 

100-103.99 

102 

5 

510 ..I 

104-107.99 

106 

1 

106 ' i 

108-111.99 

110 

1 

no : ,,j 

Total 


6,031 

145,374 - , , 


S(/w) 1145,374 

= $24.1045. 



N 6,031 


^ ' 1 

^ These figures and similar data appearing in 'subsequent tables were com- ' 

piled by Edward K. 'Frazier, of the Division of Wages, Hours and Working , ' 

Conditions, U. S. Bureau of Labor Statistics. See “Earnings and Hours in 

Blast Furnaces, Bessemer Converters, Open-Hearth Furnaces 

and Electric , . i 

Furnaces, 1933 and 1935^^ Monthly Labor Re&i&w, April, 1936. 

The detailed : . 

statistics in Table 22 were provided through- the courtesy of Dr. Isador Lubin* '■ 

Commissioner of Labor Statistics.- 
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figure $2,000 is given a weight of two, the figure $3,000 is 
given a weight of three, and the resultant sum, $13,000, is 
divided by five. Though the procedure in working from the 
frequency distribution is thus a form of weighting, the term 
“weighted average” is coming to have a more restricted 
meaning, to be explained at a later point, and should not 
in general be applied to an average computed from a fre- 
quency distribution. 


SHORT METHOD OF COMPUTING THE ARITHMETIC MEAN 

The calculation of the arithmetic mean from the fre- 
quency table is much easier, in general, than from the un- 
grouped data, but when the number of cases included is 
large even the computation from the frequency table by 
the method illustrated above may be laborious. The pro- 
cedure may be greatly simplified. 

From the method of computing the arithmetic mean it 
follows that the algebraic sum of the deviations of a series 
of individual magnitudes from their mean is zero. This 
may be readily demonstrated. We may represent the series 
of magnitudes by mi, mg, ms, . . . m„, their arithmetic 
mean by M, and the deviations of the various magnitudes 
from the mean by di, dg, ds, . . . d„. 

Then 

mi -f mg -f ms -f . . . -I- 

r: ' ; 'jV = ^ 

and 

mi -t- mg -f ms + . . . -f m„ = NM. (2) 

The number of terms, of course, is equal to N. Therefore, 
subtracting M N times from each side of the equation, 

+ + + . . . -h (m„ - M) = 0. (3) 

But 

mi — M ~ di, nh — M = da, etc., and equation (3) may be written 

= o. 

Knowing this to be true we may measure the deviations 
of a series of magnitudes from any arbitrary quantity, 
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secure the algebraic sum of the deviations, and from this 
value ascertain the difference between the arbitrary quan- 
tity and the true mean. For this difference w^ill be the 
mean of the deviations from the arbitrary origin. If we 
let M' represent the arbitrary origin, or assumed mean, 
while c = M — M', and di, d«', d/ . . . dn, represent the 
deviations of the various magnitudes from M' (i.e., d/ 
= TOi — M', di = rrii — M', etc.), then 

di' = + c, di = ^2 + c, d'i = ds + c, . . . d„' = dn + c 


and 

But 

and 


2d' = 2d + Ne. 


2d 
.'. 2d' 


c = 


0 

Nc 

2d' 
N ' 


From the known values of M' and c the value of the true 
mean may, be obtained, for M = M' + c. The procedure is 
illustrated in the following simple example; 

Table 23 

Computation of the Arithmetic Mean {Short Method) 
(Ungroiiped data) 


m 

/ 

d' 


■ 5' 

1 

- 15 

If' - 20 

15 

1 

-'■5 


25 : 

1 

+ 5 

^ ™ N 

35 

1 

+ 15 


45 

1 

+ 25 



+ 25 


:25 


+ 25 


When the deviations are measured from 20 as arbitrary 
origin there is in each case a constant error, if the devia- 
tion from the true mean be taken as standard. This error 
is equal to the difference between the true and the assumed 
means. The algebraic sum of the deviations from the 
assmned mean will equal N times this constant error, since 
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the error is repeated once for every item included. By 
dividing the sum of these deviations by N the amount of 
the error may be determined and the value of the mean 
thus obtained. 

Table 24 


Calculation of the Arithmetic Mean of Weekly Earnings of Workers 
in Open-Hearth Furnaces in the Pittsburgh District in 1935 
(Short method) 


Class- '■ 
interval 

M id- 

Fre- 

d' 

(in class- 


fd' 

Calculations 

{in dollars 

puim qiiisiivif 

interval 


+ 

M' = .$.10 

per week) 

m 

f 

units) 




$ 0~$, 3.99 

2 

67 

- 7 

469 


1, Algebraic sum of devi» 

4- 7.99 

6 

290 

- 6 

1,740 


at ions from Rf 

8- 11.99 

10 

437 

- 5 

2,185 


- 13,212 

12- 16.99 

14 

730 

4 

2,920 


+ 4,323 

16- 19.99 
20- 23.99 

18 

22 

1,056 

1,009 

- 3 

— 2 

3,168 

2,018 


, - 8,889 

24- 27.99 
28- 31.99 
32- 35.99 

26 

30 

34 

712 

609 

334 

- 1 

0 

+ i 

712 

334 

2, Calculation of c (in 
class-interval units) 

36- 39.99 

38 

42 

187 

+ 2 


. 374 


40- 43.99 

179 

■f 3 


5,37 

6.031 

44- 47.99 

46 

105 

-f 4 


420 


48- 61.99 

50 

60 

+ 5 


300 3. Reduction of c to orig- 

52- 55.99 

54 

67 

+ 6 


402 

inal units 

56- 59,99 

58 

28 

+ 7 


196 

CiaS'S-interval =® $4 

60- 63.99 

62 

37 

-h 8 


296 

c (in original units) 

64- 67.99 

66 

33 

+ 9 


297 

=*- 1.47388 X14 

68- 71.99 

70 

29 

+ 10 


290 

«$- 5,8955 

72- 75.99 

74 

16 

+ 11 


176 

76- 79.99 

78 

8 

+ 12 


96 

4. Determination of M ; 

80- 83.99 

82 

3 

4" 13 


39 

84- 87.99 

86 

8 

+ 14 


112 

M .M ' + C : 

88- 91.99 

90 

4 

+ 15 


60 

M « $30 - $5.8955 . 

92- 95.99 

94 

7 

-f- 16 


112 

$24. 1045 ' 

96- 99.99 

98 

9 

+ 17 


153 


100-103.99 

102 

5 

+ 18 


90 


104-107.99 

106 

1 

+ 19 


19 


108-111.99 

110 

' ''1 

+ 20 


20 


Total 


6,031 


- 13,212 

+ 4,323 



The work of computation may be still further abbrevi- 
ated, for observations arranged in the form of a frequency 
distribution, by measuring the deviations in terms of the 
class-interval as a unit. Then, in finally applying the neces- 
sary correction, the difference between the true and assumed 
means may be again expressed in terms of the original units. 
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The metliod may be illustrated in detail with reference to 
the wage data for which the mean has already been calcu- 
lated. 

The steps in this process of calculating the arithmetic 
mean by the short method may be briefly summarized : 

1. Organize the data in the form of a frequency distribution. 

2. Adopt as the assumed mean the midpoint of a class near the 

center of the distribution. 

3. Arrange a column showing the deviation (dO from the assumed 

mean of the items in each class, in terms of class-interval 
units. This deviation will be zero for the items in the class 
containing the assumed mean, — 1 for the items in the next 
lower class, + 1 for the items in the next higher class, and so 
on. 

4. Multiply the deviation of each class by the frequency of that 

class, taking account of signs. These products are entered 
in the column fd\ 

6. Get the algebraic sum of the items entered in the column fd\ 

6. Divide this sum by the total frequency {N). The quotient is 

the correction (c) in class-interval units. 

7. Multiply the correction (c) by an amount equal to the class- 

interval. The product is the correction in terms of the 
original units. 

8. Add this correction (algebraically) to the assumed mean 

the sum is the true mean (M). 

Location of the Median 

UNGBOUPED DATA 

The median is a value of a variable so selected that 
50 per cent of the total number of cases, when, arranged in 
order of magnitude, lie below it and 50 per cent above it. 
For many frequency distributions this is a useful and sig- 
nificant value. 

; W handling data which are not arranged in the form 
:of 'a :frequency distributi^ the location of the median is 'a ■ 
simple matter.' ■ The data having been arranged in order of ^ 
magnitude, it is: :n^ to count from one end : 

until that 'point: on the scale of values is .found which . divides ' 
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the number of cases into two equal parts. As a simple 
example we may assume that the following seven figures 
represent the annual incomes of seven individuals : 

1750 $975 11,128 $1,450 $1,475 $1,825 $1,950 

The scale of values extends from 1750 to $1,950, and 
seven items are arranged along this scale. The value of 
$1,000 has two items on one side and five items on the 
other, so obviously does not conform to our definition of 



Fig. 46. — Illustrating the Location of the Median with Ungrouped Data 
(Personal incomes of seven individuals) 


the median. The value of 11,450, which corresponds with 
the income of one of the seven individuals, is the median 
in this case. Three items lie on each side of this value; or, 
if we assume the central item to be cut in two, 3| items lie 
on each side of this point. This case is illustrated in Fig. 46. 
This diagram may help to bring out the fact that the 
median is a point on a scale so located that it cuts the 
frequencies in two. 

The problem is slightly different when an even number of 
cases is included. This condition is exemplified in the table 
on page 111 which shows the average earnings per man-hour 
in each of 38 selected industries during the year 1933. 

In this case the median must be a value on each side of 
which 19 industries lie. Therefore any value exceeding 
$0,425 (average earnings in the prepared feed industry) 
and less than $0 . 426 (average earnings in the meat packing 
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Table 25 


Average Earnings per Man-Hour in Selected Manufacturing 

Industries 


Silk and rayon goods: Commission throwing 

Anerage wage 
])er man-hour 
$.278 

Cotton goods 

.279 

Cigars 

.299 

Silk and rayon goods: Commission wearing 

.313 

Silk and rayon goods: Regular throwing 

.316 

Knit underwear 

.319 

Knit outerwear 

.358 

Cigarettes 

.361 

Silk and rajmii goods: Regular weaving 

,369 

Wool shoddy 

.370 

Hosiery 

.372 

Cotton small ^vares 

.378 

Woolen goods 

.395 

Sugar, beet 

,395 

Worsted goods 

.399 

Snuff, and chewing and smoking tobacco 

.402 

Knit cloth 

.414 

Rayon yarns 

.421 

Feeds, prepared 

.425 

Meat packing 

.426 

Pulp 

.431 

lee, manufactured 

.436 

Flour milling 

.444 

Paper 

.445 

Carpets and rugs, w^ool 

.464 

Leather tanning 

.470 

Sugar refining, cane 

.481 

Soap 

.482 

Blast furnaces 

,488 

Felt goods ■ , 

.488 : . 

Cereal .preparations 

.510 

Steel works 

.519 

Motor vehicle bodies and parts 

.561 ' , 

Machine tools 

: .585 

Motor vehicles : ' ' 

■ .610.' ■ 

Machine-tool accessories 

, .621 

Petroleum refining 

.643 

Malt 

.657 ' 


* From Monthly Labor Review, October, 1935, 910. 
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industry) will satisfy the definition of a median. Under 
these conditions, where the median is really indeterminate, 
a value half-way between the two hmiting values is accepted, 
by convention. The median of the 38 figures would thus 
be $0.4255. 

In this example the median value does not correspond 
with the earnings in any one industry. This will frequently 
be so when there is an even number of observations. 


GROUPED DATA 

The task of locating the median is essentially the same 
when the data are in the form of a frequency distribution. 
The fact that the real values of the individual items are 
not known, because of the grouping by classes, complicates 
the problem slightly. The data in Table 26, relating to 
advertising rates of daily newspapers in the United States, 
may be used in illustrating the method. 


Table 26 

Location of Median, Newspaper Advertising Rates in 1933 
Minimum Line Rates for National Advertising, 245 Daily Newspapers 
in Cities of 25,000 to 50,000 Population ^ 


Class-interval 

No. of newspapers 

Rate per line 

charging stated 

{in cents) 

rate 


/ 

1 . 0 - 2.99 

6 

3 . 0 - 4.99 

53 

5 . 0 - 6.99 

85 

7 . 0 - 8.99 

56 

9 . 0 - 10.99 

21 

11 . 0 - 12.99 

16 

13 . 0 - 14.99 

" ' 4 

15 . 0 - 16. 99 

4 


245 



In the present case the location of the median involves 
the determination of that value on each side of which 122.5 

* Source: Editor and Publisher, International Yearbook for 1933 . 
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items lie. We may assume that we start at the lower end 
of the scale and move through the successive classes. When 
we reach the upper limit of the first class (that including 
items having values from 1.0 to 3.0) we have left behind 
us 6 cases, while 239 lie in front of us. When the upper 
limit of the second class is attained, 59 items have been 
passed. The upper limit of the third class has below it 
144 items. Somewhere between the lower and upper limits 
of the third class lies the desired point, that which has 
122.5 items on each side of it. How far must we move 
through this class, from 5.0 to 7.0 in order to reach this 
point? 

It will be recalled that, for purposes of calculation, the 
assumption is made that there is a uniform distribution of 
the items lying within any given class. Since before we 
reach the third class 59 cases have been counted, only 63 . 5 
of the 85 included in this class are needed to complete the 
desired number, 122.5. On the assumption of even distri- 
bution the required 63.5 cases will lie within a distance 
63 5 

on the scale equal to of the class-interval. The class- 

85 

63 5 

interval is 2.0; of 2.0 is equal to 1.49. As we move 

OO 

up the scale, then, having reached 5 . 0, we proceed an addi- 
tional distance equal to 1 .49. At a point on the scale having 
a value of 6.49 is the dividing line on each side of which 
lie 122 . 5 cases. This is the value of the median. 

The process of computation is shown at the right of the 
frequency table. The following is a summary of the steps 
involved in the location of the median : 

1. Arrange the data in the form of a frequency distribution. 

2. Divide the total number of measures by 2; this gives the 

number which must lie on each side of the point to be located. 

3. Begin at the lower end of the scale and add together the fre- 

quencies in the successive classes until the lower limit of the 
class containing the median value is reached. 
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4. Determine the number of measures from this class which must 

be added to the frequencies already totaled to give a number 
equal to iV/2. 

5. Divide the additional number thus required by the total 

number of cases in the class containing the median. This 
indicates the fractional part of the class-interval within which 
the required cases lie. 

6. Multiply the class-interval by the fraction thus set up. 

7. To the lower limit of the interval containing the median add 

the result of the multiplication process indicated in (6). 
This gives the value of the median. 

The last three steps constitute merely a simple form of 
interpolation. 

The entire process may be reversed by beginning at the 
upper end of the scale and counting downwards. In this 
case the final operation is one of subtraction from the upper 
limit of the interval containing the median. 

N/2 may be a fractional value, as in the example given, 
or a whole number. The operation is precisely the same in 
the two cases. 

QuARTILES AND DeCILEB 

For many purposes it is desirable to locate on the scale 
of values, along which the items constituting a frequency 
distribution are ranged, points dividing the total number 
of measures in other ways. Similar to the median, which 
divides the total number of cases into two equal groups, 
are the quartiles, deciles, and percentiles. The quartiles, 
as the term implies, are points on the scale which divide 
the entire number of measures into four equal groups, the 
deciles divide the number into ten equal groups, and the 
percentiles divide the total number of cases into 100 equal 
groups. Thus the first quartile is that point on the scale 
below which one quarter of the total number of cases lie 
and above which three quarters of the total number of 
cases lie. The second quartile and the median are identical 
values. The third decile is that point on the scale below 
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which three tenths of the total number of cases lie and 
above which seven tenths of the total number of cases lie. 
In all cases the count begins at the lower end of the scale. 

Example: Location of the First Quartile {Qi}, Neiospaper Advertising Rates 

{Bee Table 26) 

iV/4 == 61.25 

Qi = 5.0 + (2.25/85 X 2.0) 

- 5.05 

Example: Location of Eighth Decile (A), Newspaper Adoertismg Bates 

(See Table 26) 

iV/10 - 24.5 Ds - 7.0 + (52/56 X 2.0) 

8iV/10 - 196 -8.86 

A method of locating median, quartiles, deciles and per- 
centiles graphically is explained below. 

Location of the Mode 

The mode is the value of the ir-variable corresponding to 
the maximum ordinate of a given frequency curve. The 
concept of a modal value is a thoroughly easy one to grasp. 
It is the most common wage, the most common income, 
the most common height. It is the point where the con- 
centration is greatest, a characteristic which is effectively 
brought out by Fechner’s term for this average, dichiester 
wert, or thickest value. It is not so easy, however, to locate 
the true modal value in a given case. In general statistical 
work an approximate value only is secured for the mode, 
but for most practical purposes this value is usually suf- 
ficiently accurate.^ 

The method of determining this approximate modal value 
may be illustrated by reference to the distribution shown 
in Table 27 on page 1 16. 

There is wide dispersion of the 22 eases falling below 40, 
and the existence of this “ open-end " class makes it impos- 
sible to compute the mean, as the table stands. The mode 

* A method of locating the mode more accurately is explained in a later 

seetioa. ’ 
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Table 27 

Frequency Distribution of Five Per Ceni Bonds 

(This table is based upon quotations on the New York Stock Exchange on 
June 13,. 1936, on railroad and industrial bonds with coupon rate of 6 per cent) 


Quoted price 

Mid-pomt 

Freqmmy 

Class-interval 

m 

/ 

Less than 40 


22 

40- 49.9 

45 

5 

50- 59,9 

55 

5 

60- 69.9 

65 

3 

70- 79.9 

75 

8 

80- 89.9 

85 

9 

90- 99.9 

95 

19 

100-109.9 

105 

49 

110-119.9 

115 

10 

120-129.9 

125 

3 

130-139.9 

135 

1 


134 


is therefore an appropriate average to employ in the present 
instance. 

The class having limits of 100-109.9 contains the greatest 
number of cases. This appears to be the modal group, and 
the mid-point of this class, 105, may be tentatively accepted 
as the value of the approximate mode. But with different 
classifications quite different values might be secured for 
the mode. When the original bond quotations are tabulated 
with varying class-intervals the following results are secured. 
(Only the frequencies of the central classes are shown. It 
is not necessary, for this purpose, to present each of the 
tables as a whole.) 


(o) 


(b) 


ic) 


(d) 


ClmS'-interml « 

= 5 

Class-inierval =» 

2.5 

Class-interml « 

2.5 

Class-interval « 

: I 

Class-interval 

/ 

Class-interval 

/ 

Class-interval 

/" 

Class-interval 

•/ 

m- 84.9 

3 

90.0- 92.49 

4 

98.75-101.249 

6 

100-100.9 

1 

'85- 89.9 

6 

92.5- 94.99 

6 

101.25-103.749 

17 

101-101.9 

2 

90- 94.9 

10 

95.0- 97,49 

2 

103,76-106,249 

20 

102-102,,. 9 .. 

9 

95- 99.9 

9 

97.5- 99.99 


106.26-108.749 

8 

103-103.9 

10 

100-104-. 9 

29 

’100.0-102.49 

9 



104-104.9 

7 

105-109.9 

20 

102.5-104.99 

20 



105-105.9 

6 

110-114.9 

7 

105.0-107.49. 

13 



106-106.9 

5 

115-119,9 

3 

107.5-109.99 

■ 7 



107-107.9 

4 
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■ With, a class-interval of 5 a value of 102.5 is secured for 
the mode; with a class-interval of 2.5 a value of 103.75 is 
obtained. A class-interval of 2.5, again, but with different 
class limits, yields a mode of 105. Finally, a class-interval 
of 1 gives a mode of 103 . 5. Further changes in classification 
would give still other values. The mode thus appears to be 
a curiously intangible and shifting average. Its value, for 
the same data, seems to vary with changes in the size of the 
class-interval and in the location of the class-limits. 

These difficulties arise primarily from limitations to the 
size of the sample being studied. The true mode, that 
value which would occm the greatest number of times in 
an infinitely large sample, could be located exactly if we 
could increase indefinitely the number of cases included. 
For, given sufficient cases, the approximate mode approaches 
the true mode as the class-interval decreases. Grouping in 
large classes obscures details, and as these classes are re- 
duced in size more of the details are seen and a traer picture 
of the actual distribution is secured. But since most prac- 
tical work is necessarily based upon relatively small samples, 
the increase in the number of classes reveals gaps and 
irregularities, and causes such a loss of symmetry and order 
that doubt arises as to where the point of greatest concen- 
tration really lies. The different tabulations of bond prices 
furnish an excellent example of this. 

By mathematical methods it is possible to obtain a value 
for the true mode without securing an infinite number of 
cases. The smoothing process has been briefly explained. 
One sort of smoothing involves the fitting of an appropri- 
ate type of ideal frequency cmve to the data of a given 
frequency distribution. This gives, theoretically, the dis- 
tribution which would be secured by the process first indi- 
cated, that of decreasing indefinitely the size of the class- 
interval and increasing indefinitely the number of cases. 
The value of the a;-variable corresponding to the maximum 
ordinate of this ideal fitted curve is the true mode. 
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For most practical purposes approximate values of the mode 
are adequate, and these may be secured by much simpler 
methods. A first and rough approximation may be obtained 
by taking the mid-value of the class of greatest frequency, a 
method suggested above. If the general rules for classifica- 
tion which were outlined in an earlier section have been fol- 
lowed, this procedure will not generally involve a gross error. 

It is possible, given a fairly regular distribution, to secure, 
by a process of interpolation within the modal group, a 
closer approximation than is obtained by accepting the mid- 
value of this group as the mode. Referring again to the 
tabulation of bond prices in Table 27 it will be noted that 
the distribution on the two sides of the modal class is not 
symmetrical. The modal class is that with a mid- value of 
105. The class next below, with a mid-value of 95, contains 
19 cases, while that next above, with a mid- value of 115, 
contains but 10 cases. The disproportion is continued in 
the succeeding classes below and above, more cases being 
bulked below the modal class than above. For other pur- 
poses we have assumed an even distribution of cases between 
the upper and lower limits of each class, but it is probable 
that this is not true of the modal class in the present case. 
Judging from the distribution outside this class, it is likely 
that the concentration is greater in the lower half of the 
class-interval, that is, between 100 and 105. The mode, 
therefore, probably lies below the mid-value 105, rather 
than precisely at that point. We may attempt to locate it 
within the group by weighting, assuming a pull toward the 
lower end of the scale equal to 19 (the number in the class 
next below) and a pull toward the upper end of the scale 
equal to 10 (the number in the class next above). This may 
be expressed by a formula, employing the following symbols: 

I = lower limit of modal class. 

fi = frequency of class next below modal class in value, 

A = frequency of class next above modal class in value. 
i = class-interval. 
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The interpolation formula is 

Mo = 1 + X i. 

h + Ji 

Appljnng this formula to the bond price data presented in 
Table 27, we have 

Mo = 100 + X lo) = 100 + 3.45 = 103,45. 

A closer approximation may sometimes be secured by bas- 
ing the weights (represented by fi and /O upon the total 
frequencies of the two or three classes next above the 
modal class and the same number below. If three classes 
on each side are included in the present case, a value of 
102 . 8 is secured for the mode of bond prices. 

In some cases the problem of locating the mode is com- 
plicated by the existence of several points of concentration, 
rather than the single point which has been assumed in 
the preceding explanation. Thus in Table 9, representing 
the distribution of wages, with a class-interval of 25 cents, 
there are two definite modal points. A distribution of this 
type is called bi-modal; when plotted, a frequency curv^e 
having two humps is obtained. If the data are homogene- 
ous such a distribution is the result of paucity of data and 
of the method of classification employed. It may be due 
to the use of a class-interval too small, with respect to the 
number of cases included in the sample. An approximate 
mode may be determined in such cases by shifting the 
class-limits and increasing the class-interval, carrying on 
this process until one modal group is definitely established. 
This reverses the process by which the true mode may be 
located when the number of cases is infinitely large. Under 
such conditions the class-interval might be reduced until 
it was infinitely small. But with a limited number of cases 
the location of the point where the concentration is greatest 
necessitates increasing the size of the class-interval, in order 
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to get away from the irregularities due to the smallness of 
the sample. 

If the distribution remains bi-modal in spite of changes 
in the class-intervals and class-limits, it is probable that 
the data are not homogeneous, that two different distri- 
butions have by mistake been combined. Such cases are 
not uncommon in biometrical work. The existence of two 
distinct animal species where only one was suspected has 
been revealed in this way. The whole significance of a 
frequency distribution will be lost if the data are not homo- 
geneous, a fact which is as true of work in the field of eco- 
nomic statistics as in any other. 

DETERMINATION OF THE MODAL VALUE PROM MEAN 
AND MEDIAN 

Another method of securing an approximate value for 
the mode, a method based upon the relationship between 
the values of the mean, median and mode, may be em- 
ployed in certain cases. In a perfectly symmetrical distri- 
bution mean, median and mode coincide. As the distribu- 
tion departs from symmetry these three points on the scale 
are pulled apart. If the degree of asymmetry is only mod- 
erate the three points have a fairly constant relation. The 
mode and mean lie farthest apart, with the median one 
third of the distance from the mean towards the mode. If 
the asymmetry is marked, no such relationship may pre- 
vail. Having the values of any two of the averages in a 
moderately asymmetrical frequency distribution, therefore, 
the other may be approximated. In fact, however, the 
method should only be employed in determining the value 
of the mode, as the other two values may be computed 
more accurately by other methods. The value of the mode 
itself should only be determined in this way when more 
exact methods are not applicable or are not called for. 

The following formula is based upon this relationship : 

Mo = Mean - 3(Mean - Md). 
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Applying this formula to the telephone pole data shown 
in Table 12, the following result is secured: 

Mo = 9.33 - 3(9.33 - 9.015) = 8.385. 

This value is slightly below the mid-value of the modal 
class, 8.5, and is also less than the value 8.49 which is se- 
cured by weighting within the modal group (using four classes 
on each side). 

It must be emphasized that there is a fictitious accuracy 
to all these values for the mode. All the methods of locat- 
ing the mode which have been discussed are merely approx- 
imative, a fact not to be forgotten in interpreting and uti- 
lizing the results. 

Graphic Location of Mode, Median, Qtjartilbs, and 

Deciles 

A better understanding of the frequency curve and of 
the cumulative frequency curve may be secured through a 
brief discussion of certain methods of locating graphically 
some of the statistical measures that have been described. 

The value of the mode may be readily determined from 
a frequency curve of the usual type, for, by definition, the 
mode is the reading on the horizontal scale corresponding 
to the maximum ordinate of such a curve. If this reading 
be taken from the frequency polygon a rough value will be 
obtained, the mid- value of the class of greatest frequency. 
A closer approximation to the true value of the mode will 
be secured from a curve which has been smoothed, either 
by inspection or by mathematical methods. Figure 47, 
showing a curve (smoothed by inspection) based upon the 
wage data presented in Table 8, indicates how the mode 
may be located graphically. The horizontal reading corre- 
sponding to the maximum ordinate of this curve is $27. 60, 
an approximate value of the mode which may be compared 
with the values of $27 . 69 secured by the weighting process 
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and of S27 .3470 secured from the values of the mean ar 
median. 

The locations of the median and mean have been ind 
cated on this chart. It has been pointed out that in mo( 
erately asymmetrical (or skewed) distributions there teiic 
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val depends upon the number of cases added within the cor- 
responding interval on the horizontal scale. Thus the curve 
rises gradually at first, then more steeply, and tails off 
gradually at the upper extremity. The value of the mode, 
obviously, is the reading on the horizontal scale correspond- 
ing to the point of greatest steepness. This is the point at 



Illustrating the Graphic Location of Median and Quartiles 

which the increase of frequencies is greatest, the point of 
greatest concentration in the frequency distribution. The 
value of the mode may be approximated from a smoothed 
frequency curve by locating the point at which the slope is 
greatest (which is a point of inflection) and taking the corre- 
sponding reading on the a:-scale. In the present case a value of 
approximately $27 . 50 is secured for the mode by this method. 

Values for the median, quartiles, and deciles may also be 
secured graphically from the cumulative frequency curve. 
The smoothing of such a curve provides a quite satisfactory 
method of interpolation and, if the scale of the diagram 
is suflBciently large, accurate values may be obtained by 
this method. Locate on the vertical scale (the scale of 
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Table 28 

CumuMive Distribuiion of Wage-Earners in a Manufacturing 
Establishment 

(Classified on the basis of weekly earnings) 


Weekly earnings 

Number carnifig b' 
{frequvn 

Less than $22.50 

' 0 

« 

<i 

23.00 

1 


a 

23.50 

5 

a 

It 

24.00 

8 

a 

a 

24.50 

: 19 

a 

a 

25.00 

29 

it 

it 

25.50 

41 

u 

a 

26.00 

'56 

i( 

It 

26.50 

78 

{( 

tt 

27.00 

98 

i( 

it 

27.50 

122 

a 

tt 

28.00 

152 

(( 

it 

28.50 

169 

(( 

it 

29.00 

186 

a 

tt 

29.50 

193 

a « 

tt 

30.00 

199 


tt 

30.50 

204 

ii 

it 

31.00 

208 

a 

a 

31.50 

209 

G 

tt 

32.00 

209 

a 

it 

32,50 

210 


cumulative frequencies) a point distant from the base by 

If from this point a horizontal line be extended to the cumula- 
tive curve, the abscissa of the point of intersection will be 
the value of the median. This value may be easily deter- 
mined by dropping a vertical line from the point of inter- 
section to the x-axis. Figure 48 illustrates the application 
of this method. A value of $27 . 125 is secured for the median 
by this method. By direct interpolation a value of $27 . 1458 
is obtained. The quartiles may be located in precisely the 
same way, the vertical scale being divided into quarters 
and horizontal lines extended to the cumulative curve from 
the points thus located on the vertical scale. 
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For some purposes, particularly those that iavolve the 
averaging of rates or ratios rather than quantities, none of 
the averages which have been described is suitable. The 
geometric and the harmonic means are types of averages 
that should be familiar because they are particularly ap- 
propriate for such purposes. 

The Geometric Mean 

The geometric mean is the nth root of the product of n 
measures; its value thus is represented by: 

Mo — ^ai-a^-ot ... an- 

The geometric mean of the numbers 2, 4, 8, is 

Mo = >^2 X 4 X 8 
= ^64 
= 4. 

It is obvious from the method of computation that if 
any one of the measures in the series has a value of zero the 
geometric mean is zero. 

The actual computation of the geometric mean is greatly 
facilitated by the use of logarithms. In this form 

Log M = ai -Hog 02 -f- log as -f ■ • ■ + log a« 

The logarithm of the geometric mean is equal to the arith- 
metic mean of the logarithms of the individual measures. 

When the measures, of which the geometric mean is de- 
sired, are to be weighted, the separate weights are intro- 
duced as exponents of the terms to which they apply. Thus 
if we represent the sum of the weights by N and the weights 
corresponding to the terms Oi, Ot, . . . an, respectively, 
by Wi, Wz, Wz . . . Wn, the formula for the geometric mean is 

Mo = • 02”’* • Os’"* . • . an®”. 

This is equivalent to repeating each term a number of 
times, the number corresponding to the amount by which 
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it is weighted. (This, of course, is precisely what is done 
in securing a weighted arithmetic mean.) When logarithms 
are employed the formula for the weighted geometric mean 
becomes 

, ,, u>i log di + Wi log as + Ws log Oa + ■ • • + iVn log a,, 

Log Mg — Y” 

A method of computing the geometric mean may be 
illustrated with reference to Table 29, which shows the 
distribution of the prices of 66 preferred stocks paying 
seven per cent dividends. The table is based upon closing 
prices on the New York Stock Exchange and the New York 
Curb Exchange for the week ended July 23, 1936. 

Table 29 


Computation of the Geometric Mean of Preferred Stock Prices 


Class-interval 

7)1 

/ 

log )n . 

/logm 

$70-1 89.9 

80 

5 

1.90309 

■ 9,51545 

90- 109.9 

100 

20 

2.00000 

40.00000 

no- 129.9 

120 

27 

■ 2.07918' 

56. 13786 

130- 149.9 

140 

6 

2.14613 

12.87678 

150- 169.9 

160 

8 

2.20412' ■ 

17.63296 

Log Mg 

136.16305 

66 

66 

136.16305 

Log = 2.06308 

115.& 


CHARACTEKISTICS OF THE GEOMETRIC MEAN 

The nature of the geometric mean may be understood 
by considering its relation to the terms it represents, as an 
average. 

If the arithmetic mean of a series of measures replace 
each item in the series, the sw?n of the measimes will remain 
unchanged. Thus, the sum of the numbers 2, 4, 8 is 14, 
The arithmetic mean of these three numbers is 4f ; if this 
value be inserted in the place of each of the three measures 
the sum remains 14. It is characteristic of the geometric 
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mean that the product of a series of measures will remain 
unchanged if the geometric mean of those measures replace 
each item in the series. Thus the product of 2, 4, 8 is 64. 
The geometric mean of the three numbers is 4; if this value 
replace each of the three measures the product remains 64. 

Again, it is true of the arithmetic mean that the sum of 
the deviations of the items above the mean equals the 
sum of the deviations of the items below the mean (disre- 
garding signs). The sums of the differences between the 
individual items and the mean are equal. In the case of 
the geometric mean the products of the corresponding ratios 
are equal. If the ratios of the geometric mean to the meas- 
ures which it exceeds be multiplied together, the product 
will equal that secured by multiplying together the ratios 
to the geometric mean of the measures exceeding it in value. 
For example, the geometric mean of the numbers 3, 6, 8, 9 
is 6. The following equation may be set up: 


3^6 6^6 


The last example brings out the most important charac- 
teristic of the geometric mean. It is a means of averaging 
ratios. Its chief use in the field of economic statistics has 
been in connection with index numbers of prices, where 
rates of change are of major importance. A rise in prices 
represented by the change from 50 to 100 is as important 
as a rise from 100 to 200. Yet this equivalence is not brought 
out by the arithmetic mean, which gives double weight 
to the change which involves an absolute difference of lOO. 
An example frequently cited is that of two cases of price 
change, one a ten-fold increase, from 100 to 1,000, the other 
a fall to one tenth of the old price, from 100 to 10. The 
arithmetic mean of 1,000 and 10 is 505, the geometric 
mean is Vl,000 X 10, or 100. When the average is of the 
latter type it is seen that the two equal ratios of change 
have balanced each other. The arithmetic mean, 505, is 
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quite incorrect as a measure of average' rat io of price change. 
This subject is discussed at greater length in the chapter on 
index numbers. 

What has been said in an earlier section in regard to the 
advantages of logarithmic charting for certain purposes 
bears upon the use of the geometric mean. This average 
is sometimes called the logarithmic mean, as its logarithm 
is simply the arithmetic mean of the logarithms of the 
constituent measures. 'Wherever percentages of change are 
being averaged, where ratios rather than absolute differ- 
ences are significant, the use of the geometric mean is 
advisable. 

A problem involving the use of the geometric mean arises 
in computing the average rate of increase of any sum at 
compound interest. If po represent the principal at the 
beginning of the period, p„ the principal at the end of the 
period, r the rate of interest and n the number of years 
in the period, the sum to which will amount at the end 
of the n years, if interest is compounded annually, is repre- 
sented by the equation ; 

Pn = P»(l -f r)". 

It follows from this that : 



Thus, if $1,000 at compound interest amounts to $1,600 
at the end of 12 years, there has been an increase of 60 per 
cent. The arithmetic mean is 5 per cent, but this is not the 
rate at which the money increased. The true rate is: 



= ^1760 - 1 
= 1.04 - 1 
= .04, or 4% 
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Precisely the same problem arises whenever rates of in- 
crease or decrease are to be averaged. The use of the arith- 
metic mean gives an incorrect result. 

THE GEOMETRIC MEAN AS A MEASURE OP CENTRAL 
TENDENCY 

A question arises as to the type of frequency distribution 
the central tendency of which would be best represented 
by the geometric mean. When the absolute measures, 
plotted on the arithmetic scale, give a fairly symmetrical 
distribution, the arithmetic mean is clearly preferable to the 
geometric mean. But when the absolute figures thus plotted 
give an asymmetrical frequency curve of such a type that 
the asymmetry would be removed and a symmetrical curve 
secured by plotting the logarithms of the measures, the 
geometric mean would appear to be preferable. Such a 
distribution would be one in which not the absolute devia- 
tions about the central tendency but the relative deviations, 
the deviations as ratios, were symmetrical. The arithmetic 
mean of the logarithms of the various measures (which 
value is, as has been shown, the logarithm of the geometric 
mean of the original measures) would be the best representa- 
tive of the central tendency in such a distribution. The 
curve thus plotted would be symmetrical about the logarithm 
of the geometric mean. A frequency curve representing 
the logarithms of percentage changes in prices would tend 
to show this symmetry about the logarithm of the geometric 
mean of these changes. These percentage changes, as nat- 
ural numbers, group themselves in an asymmetrical form, 
with the range of deviations above the arithmetic mean 
greatly exceeding the range below. ^ This arises, of course, 
from the fact that prices of given commodities may increase 
1,000 per cent or more from a given base, but cannot fall 
more than 100 per cent from any given base. The section 


' Of. Fig. 51. 
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on index numbers contains a fuller discussion of this partic- 
ular phase of the subject.^ 

The construction of a frequency distribution in which loga- 
rithms are tabulated would be laborious, if the logarithm 
of each item to be entered had to be determined, before 
tabulation. It is possible, however, witli no great trouble to 
construct a true logarithmic distribution, with Gass-interval 
constant in terms of logarithms. The tiO quotations on pre- 
ferred stocks, tabulated in Table 29, range from 74 to 106. 
The logarithm of 74 is 1,86923; the logarithm of 166 is 
2.22011. The range, in logarithms, is .35088. We may 
select .06 as a suitable logarithmic class-interval, for the 
present purpose. For convenience in tabulating the data 
we set up two series of class limits, one in terms of logarithms, 
one in terms of the corresponding natural numbers. In 
constructing the distribution natural numbers may be tab- 
ulated, utilizing the class limits defined in natural terms. 
All subsequent calculations may be carried through in terms 
of logarithms. The distribution appears in Table 30 on 
page 131. 

If the geometric mean is considered appropriate for a given 
series, the type of distribution represented by Table 30 is 
more logical than that shown in Table 29, and the descrip- 
tive measurements secm-ed from Table 30 have correspond- 
greater validity. We may derive the mean of the 

* C. M. Walsh, in The Problem of Estimation (London, P. S. King & Son, 
1921) 35, lays down the following criteria for the use of averages: 

(а) When there are no conceivable or assignable upper or lower limits to the 

values of the terms in a series, the arithmetic average should be em- 
ployed. 

(б) When there is a definite lower limit at or above zero and no upper conceiv- 

able or assignable limit, the geometric average should be employed. 
Because this is true of price changes Walsh believes the geometric 
average to be the correct one to use in making index numbere of prices, 
(c) When in practice, or in the nature of things, certain upper and lower limits 
are found to exist and the above criteria cannot be employed, a study of 
the actual dispersion of the data is necessary. In this case, if the mode is 
found nearer to the arithmetic average, that average should be em- 
ployed! h the mode is found nearer to the geometric average, that aver- 
age should be used. 
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Table 30 

Distribution of Prices of Preferred Stocks 
Paying Seven Per Cent Dividends 


Class-interval 

Class-interval 

Mid-pomt 

Frequency 


(natural mmibers) 

(logainthnis) 

(logarithms) 




m 

f 

fni 

$ 70 . 80-1 81.27 

1 . 85 - 1.9099 

1.88 

2 

3.76 

81 . 28 - 93.32 

1 . 91 - 1.9699 

1.94 

4 

7.76 

93 . 33 - 107.15 

1 . 97 - 2.0299 

2.00 

12 

24.00 

107 . 16 - 123.02 

2 . 03 - 2.0899 

2.06 

30 

61.80 

123 . 03 - 141.24 

2 . 09 - 2.1499 

2.12 

6 

12 . 72 

141 . 25 - 162.17 

2 . 15 - 2.2099 

2.18 

7 

15.26 

162 . 18 - 186.20 

2 . 21 - 2.2699 

2.24 

5 

11 . 20 




66 

136.50 


logarithms of the preferred stock prices by dhdding S/m 
of Table 30 (136.50) by 66. The value is 2.06818. The 
anti-log of this is 116.97, which is the geometric mean of 
the distribution. This differs somewhat from the value 
$115.63 secured from Table 29. The difference is due, in 
part, to the use of different class-intervals and class limits 
in the two cases. With a relatively small number of observa- 
tions such differences would be expected to lead to different 
results. Differing assumptions concerning the internal dis- 
tribution of items within the several classes would also 
contribute to a discrepancy between the two results. The 
value obtained from Table 30 is probably a closer approxima- 
tion to the actual geometric mean than that obtained from 
Table 29. 

A frequency curve based upon the logarithms of the 
measures included rather than upon the natural numbers, 
has been employed to advantage in plotting data relating 
to income distribution. When natural numbers are plotted, 
the range of income distribution is so large that it is physi- 
cally impossible to prepare a chart that will reveal the char- 
acteristic features of all sections of the curve. The process 
of plotting on double logarithmic paper (which is, of course, 
equivalent to plotting the logarithms of both r’s and v’s) 
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meets this difficulty, giving a true impression of the whole 
distribution and the relations between its parts, and, at the 
same time, brings out certain important, features that are 
obscured in the natural scale chart. In particular, this 
device appears to smooth into a straight line that part of 
the curve lying above the mode, a fact which led Vilfredo 
Pareto to enunciate what has been known as Pareto's Law 
concerning income distribution. An intensive study of the 
distribution of income in the United States has led the staff 
of the National Bureau of Economic Research to call into 
question certain conclusions drawn from Pareto’s generaliza- 
tions, though the value of the double logarithmic scale for 
the presentation of income data has been recognized. 

The Harmonic Mean 

, The harmonic mean is a type of average capable of 
application only within a restricted field, but which should 
be employed to avoid error in handling certain types of 
data. It must be used in the averaging of time rates and 
it has distinctive advantages in the manipulation of some 
types of price data. The following example will illustrate 
the method of employing this average. 

A given commodity is priced, in three different stores, at 
“four for a dollar,” “five for a dollar” and “twenty for a 
dollar.” The average price per unit is required. The arith- 
metic average of the figures given (4, 5, and 20) is 9|. If 
we take this to be the average number sold per dollar, the 
average price would appear to be $1.00 -4- Of, or lOM cents 
each. But the original quotations are equivalent to unit 
prices of 25 cents, 20 cents, and 5 cents; the arithmetic 
average of these prices is 16f cents apiece. The discrepancy 
between lOif cents and 16f cents is due to a faulty use 
of the arithmetic mean in averaging quotations in the “so 
many per dollar” form. Such a mean is, in effect, a weighted 
average, with greater weight being given to quotations 
involving a larger number of commodity units. 
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The eorrect result may be secured by taking the harmonic 
mean of the three original quotations. The harmonic mean 
of a series of numbers is the reciprocal of the arithmetic mean 
of the reciprocals of the individual numbers. Thus if we repre- 
sent the numbers to be averaged by n, r% . . . r„, the formula 
for the harmonic mean, H, is 


H 


- + - + - + 
n r2 rs 


+ 


N 


Using the figures just quoted: 


H 


l + l+l 

4 5 20 


= 1 
60 6 


H = Q. 

The harmonic mean of 4, 5, and 20 is 6, the average number 
of units sold per dollar. The average price per unit is 
16f cents. 

The computation of the harmonic mean of a series of 
magnitudes is greatly facilitated by the use of prepared 
tables of reciprocals.^ 


Relations between Difpeeent Averages 

When different averages are located or computed for a 
given series of magnitudes, certain relationships between 
them are found to prevail. 

1. The arithmetic mean, the median and the mode coincide in 

a S3mmetrical distribution. 

2. In a moderately asymmetrical distribution the median lies 

between the mean and the mode, approximately one third 
of the distance along the scale from the former towards the 

1 Barlow’s Tables of Sgvares, Cubes, Square Roots, C%ibe Boots and Reciprocak, 
New York, Spar and Chamberlain. 



134 ;' 

latter. Hen.ce,. for this'-type of distribution there ;is,a.ii ap* 
proximation to the following re!atioiishi}>: 

Mo - M «3(i¥ -- Miih 

3.. The arithmetic mean of any series of magnitudes is greater 
than their geometric mean. 

4. The geometric mean of any series of magnitude's is greater 

than their harmonic mean. . The only c‘X«*eptiun to the last 
two rules is found when all the measures in the series are 
equal, i.n which case arithmetic mean, geometric mean and : 
harmonic mean are equal. 

5. The. geom.etric mean .of any two terms is equal to the geometric ; 

mean of the harmonic and arithmetic means of those terms. 
Thus if the terms be 2 and 8, the harmonic mean is 3i, the 
geometric mean 4, and the arithmetic mean 5. But 4 is also 
the geometric mean of 3^ and 5. This relationship does not 
hold when the series includes more than two terms, unless 
the terms constitute a geometric series. 

6. When the dispersion of data follows the arithmetic law, the 

mode and median will generally be found closer to the 
arithmetic than to the geometric average. When the dis- 
persion follows the geometric law the mode and median wall 
generally be found closer to the geometric than to the arithme- 
tic average. 

Charactebistic Features of the Chief Averages 
The arithmetic mean 

1. The value of the arithmetic mean is affected by every measure 

in the series. For certain purposes it is too much affected by 
extreme deviations from the average. 

2. The arithmetic mean is easily calculated, and is determinate 

in every case. 

3. The arithmetic mean is a computed average, and hence is 

capable of algebraic manipulation. 

The median 

1. The value of the median is not affected by the magnitude of 

extreme deviations from the average. 

2. The median may be located when the items in a series are not 

capable of quantitative measurement. 

3. The median may be located when the data are incomplete, 

provided that the number and general location of all the 
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cases be known, and that accurate information be available 
concerning the measures near the center of the distribution. 
4. The median is not as well adapted to algebraic nianipiilation 
as the arithmetic, geometric and harmonic means. 

The mode 

1. The value of the mode is not affected by the magnitude of 

extreme deviations from the average. 

2. The approximate mode is easy to locate but the determination 

of the true mode requires extended calculation. 

3. The mode has no significance unless the distribution includes a 

large number of measures and possesses a distinct central 
tendency. 

4. The mode is the average most typical of the distribution, 

being located at the point of greatest concentration. 

5. The mode is not capable of algebraic manipulation. 

The geometric mean 

1. The geometric mean gives less weight to extreme deviations 

than does the arithmetic mean. 

2. It is strictly determinate in averaging positive values. 

3. The geometric mean is the form of average to be used when rates 

of change or ratios between measures are to be averaged, 
as equal weight is given to equal ratios of change. It is par- 
ticularly well adapted to the averaging of ratios of price 
change. 

4. The geometric mean is capable of algebraic manipulation . 

The harmonic mean 

1. The harmonic mean is adapted to the averaging of time rates 

and certain similar terms. It has been employed in the 
field of economic statistics in the manipulation of price data. 

2. The labor of computing the harmonic mean and its unfamiliarity 

detract from its usefulness in ordinary statistical analysis. 

3. The harmonic mean is capable of algebraic manipulation. 

This summary has been designed to show that each 
type of average has its own- particular field of useMness. 
Each one is best for certain purposes and under ■ certain 
conditions. The characteristics and limitations: of each one 
should be understood in order that it may be appropriately 
employed. A complete description, of a frequency : .distribu- 
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tion frequently calls for the determination of two or three 
of the chief averages, as well as other statistical measure- 
ments. The arithmetic mean is perhaps the most useful 
single average. The simplicity of its computation, the 
possibility of employing it in algebraic cahmlations and 
the fact that its meaning is perfectly definite and familiar 
make it highly serviceable in statistical work. Its sphere 
of usefulness is not universal, however, and it should only 
be employed when the given conditions render it suitable. 
A fuller appreciation of the distinctive virtues of the geo- 
metric mean is leading to a wider employment of that 
measure in many types of statistical work. A discriminat- 
ing use of averages is essential to sound statistical analysis. 
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CHAPTER V 


DESCRIPTION OF THE FREQUENCY 
DISTRIBUTION: MEASURES OF 
VARIATION AND SKEWNESS 

In the preceding chapters we have been concerned, first, 
with methods of reducing a mass of quantitative data to a 
form in which the characteristics of the mass as a whole 
may be readily determined and, in the second place, with 
methods of describing the assembled data. The first ob- 
ject is accomplished with the formation of a frequency 
distribution. The second is partially accomplished when 
there has been obtained a single significant value in the 
form of an average which represents the central tendency 
of the distribution. But any average, by itself, fails to give 
a complete description of a frequencj'' distribution. Three 
other values are needed before the chief characteristics of a 
given distribution have been measured, and comparison 
with other distributions is possible. The first of these is a 
measure of the degree to which the items included in the 
original distribution depart or vary from the central value, 
the degree of “scatter,” variation or dispersion. The second 
is a measure of the degree of symmetry of the distribution, 
of the balance or lack of balance on the two sides of the 
central value. The third is a measure of kurtosis, of the de- 
gree to which there is a bunching of cases at the modal 
value. The present chapter deals with various measures of 
variation and skewness. The method of measuring kurtosis 
is referred to at a later point. 

Nature and Significance of Variation 

The fact of variation in collections of quantitative data 
has been pointed out in earlier sections and the bearing of 
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this fact upon the work of the statistician indicated Prac 
ticaUy every collection of quantitative data, consisting of 
measurements from the social, biological, or economic field 
IS characterized by variation, by quantitative differences 
among the individual units. .4nd this fact of variation is 
as important as the fact of family resemblance. Biological 
variation has been a fundamental factor in the evolutionarv 
process. No measurement of a physical characteristic of “a 
racial group, such as height, is complete without an ac- 
companymg me^^e of the average variation in the group 
n this respect. The average income in a country is perhaps 
of less^significance than the variation in income, the differ- 
ences between the incomes received by different economic 
classes. Price variations interrupt the normal functioning 
of the econoimc system, causing hardship to some and 
mlrn! ^ because the various ele- 

cLZ« • 1 of prices but differences among 

caSroul^.' ^ commodities and service? 

detV7v!Xr^ unless the 

known If thf • r frequency distribution is 

‘,1 the variation is so great that there is no pro- 

With 1 tendency an average has no significance. 

becomes mcrrsLgfy ^gnihcant 
« distributioUi. 

Tendrar^' Tf' a measure of 

SaUo^ ** »ipplemented by a measure of 

Measubes op Absolute Variation 

" of ns xms; 
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units are employed absolute variability is measured; when 
an abstract figure is secured we have a measure of relative 
variability, more suitable for comparison than the former 
type. Measures of absolute variability are first considered. 

THE EANGE 

A rough measure of variation is afforded by the range, 
which is the absolute difference between the value of the 
smallest item and the value of the greatest item included 
in the distribution. Table 20 in Chapter IV shows the dis- 
tribution of London-New York monthly exchange rates 
during the period 1882-1913. The smallest item among the 
original figures included in the table is $4.83; the greatest 
is $4,908. The range, therefore, is $4.908-$4.83, or $.078. 
A distance on the scale equal to $.078 will include every 
item. If the original data were not to be had the range 
could be approximated from the frequency table. It would 
be the difference between the lower limit of the class at the 
lower extreme of the distribution, and the upper limit of 
the class at the upper extreme, or $.085 in the present 
ease. 

The value of the range, it is obvious, depends upon the 
values of the two extreme cases only. A single abnormal 
item would change its value materially. Because it is 
erratic and is likely to be unrepresentative of the tme 
distribution of items, it is seldom used in statistical work. 
The range is frequently employed as a measure of stock 
market fluctuations, though its adequacy for this purpose 
may be questioned. 

THE MEAN DEVIATION 

A more accurate measure of the dispersion of items about 
a central value is afforded by the simple device of measuring 
the deviation of each item from this central value and aver- 
aging these deviations. The simple example in Table 31 
illustrates the method of computation: 
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Table 31 

ComjmMim of Mean Deviation 


m ' 

/ 

d 


3 - 

1 

6 

M' =■ 9 

6 

1 

3 


9 

12 ■ 

1 

1 

0 

3 

M.D. = f = 3-6 

16 

1 

6 




18 



The average (the mean and median coincide in this case) 
is 9. The deviations are added, taking no account of alge- 
braic signs, and the total divided by the number of items. 
This procedure is described by the expression 


In general terms, the mean deviation of a series of mag- 
nitudes is the arithmetic mean of their deviations from an 
average value (either mean or median). In the process of 
summation and averaging the algebraic signs of the devi- 
ations are disregarded. In practice it makes little differ- 
ence whether deviations be measured from the mean or 
the median. Theoretically the latter should be chosen, for 
the value of the mean deviation is least when the median 
is the point of reference. 

Table 32 illustrates the computation of the mean devi- 
ation when the data are grouped in a frequency distribu- 
tion.^ In this work, as in certain other computations, w^e 
make the assumption that the items in each class-interval 
are uniformly distributed throughout that interval. 

The median hourly wage of the 4,216 steel w'orkers repre- 
sented in this distribution is 48. 11 cents. The mean devia- 

‘ Since the uses of the mean deviation are somewhat limited, the beginning 
student may well omit the remainder of the section on the mean deviation. 
After study of the more widely employed standard deviation the student may 
wish to return to the computation of the mean deviation of observations 
grouped in a frequency distribution. 


Table 32. Computation of Mean Demotion 
Average Hourly Earnings of Workers in OpeU'-H earth Furnaces 
in the Great Lakes and Middle West District in 1933 


Clm%4nterml 
(in cents per 
hour) 

Mid- 

point 

Fre- 

quency 

Deviation 

from 

arbitrary 





origin 




m 

/ 

d' 

fd' 


25.0- 29.9 

27.5 

12 

20 

240 

c=0.61 

30.0- 34.9 

32.5 

472 

15 

7,080 

(Median = 48.11 

35.0- 39.9 

37.5 

700 

10 

7,000 

Arbitrary origin =47.5 

40.0- 44.9 

42.5 

601 

5 

3,005 

c = 48.11 - 47.5 = 0.61) ■ 

45.0- 49.9 

47.5 

520 

0 

0 


50.0- 54.9 

52.5 

537 

5 

2,685 

Na = No. of observations in 

55.0- 59.9 

57.5 

397 

10 

3,970 

classes above that 

60.0- 64.9 

62.5 

225 

15 

3,375 

containing the median 

65.0- 69.9 

67.5 

139 

20 

2,780 

= 1911 

70.0- 74.9 

72.5 

111 

25 

2,775 


75.0- 79.9 

77.5 

43 

30 

1,290 

Nb = No. of observations in 

80.0- 84.9 

82.5 

111 

35 

3,885 

classes below that 

85.0- 89.9 

87.5 

74 

40 

2,960 

containing the median 

90.0- 94.9 

92.5 

59 

45 

2,655 

= 1785 

95.0- 99.9 

97.5 

45 

50 

2,250 


100.0-104.9 

102.5 

51 

55 

2,805 

Nm = No. of observations in 

105.0-109.9 

107.5 

78 

60 

4,680 

the class-interval con- 

110.0-114.9 

112.5 

6 

65 

390 

taining the median 

115.0-119.9 

117.5 

17 

70 

1,190 

= 520 

120.0-124.9 

122.5 

1 

75 

75 


125.0-129.9 

127.5 

2 

80 

160 

i = 5 

130.0-134.9 

132.5 

5 

85 

425 


135.0-139.9 

137.5 

7 

90 

630 

Calculaiiotis 

140.0-144.9 

142.5 

1 

95 

95 


145.0-149.9 

147.5 

1 

100 

100 

(1) Sum of deviations from 

150.0-154.9 

152.5 

0 

105 

0 

arbitrary origin of all 

155.0-159.9 

157.5 

1 

4,216 

110 

no 

56,610 

observations in classes 
other than that contain- 
ing the median = 56, 6 ID 

Computation of median: 



N 

2,108 




(2) (m - Na}c ^ - 76.86 

— s=s 




2 

r 323 

X (6.0) 1 


C2+")* 


' Md = 



(3) Nm 2i 

■ 

: 45.0 + 3.11 




■' = 

: 48.11 




\2 

+ = 688.67 


Sum of deviations from median = 56,610 — 76.86 

+ 688.67 
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tion could be computed directly, with reference to deviations 
from the median, but it is simpler to measure the deviations 
from the midpoint of the class containing the median, and 
then apply corrections to offset the resulting error. 

In Table 32 deviations have been measured not from 
48.11, the value of the median, but from 47.5, the midpoint 
of the class in which the median falls. Working with these 
measurements, the computations involve three steps: 

1. Obtaining the sum of the deviations from the assumed 
median of all items falling in classes other than that con- 
taining the true median. 

2. Correcting this sum for the error involved in the use of 
an origin other than the true median. 

3. Adding to the corrected sum the sum of the deviations 
from the median of the items within the class-interval con- 
taining the median. 

(1) The sum referred to in (1) is obtained directly, in the man- 
ner indicated in Table 32. ‘ It comes to 56,610. 

(2) The four classes below that containing the median con- 
tain 1,785 items. The deviation of each of the.se items from the 
true median, 48. 11, is greater by 0-61 than the deviations actu- 
ally recorded in Table 32, which are measured from 47.5. The 
measured deviations are too small by 0.61 for 1,785 items. The 
22 classes above that containing themedian contain 1,911 items. 
For each of these the deviation from the true mean, 48. 1, is less 
by 0.61 than the deviations actually recorded, which are meas- 
ured from 47.5. The measured deviations are too large by 
0.61 for 1,911 items. Accordingly the figure 56,610, which we 
have obtained as the sum of the deviations from the arbitrary 
origin of all the items in classes other than that containing the 
true median, must be corrected by the addition to it, algebra- 
ically, of + (1,785 X 0.61) and - (1,911 X 0.61). 

* This is not the sum of the deviations from 4.75, the arbitraiy origin. For 
■ no account is taken of the deviations from that value of the 520 items falling 
j within the class in question. If these are scattered uniformly throughout the 
da^interval they will contribute to the total of the deviations from 4.75. 
This would not be so if we were working on the assumption that all the items 
in a class are concentrated at the midpoint. In computing the mean devia- 
tion, however, it is necessary to make a different assumption, namely, that of 
uniform distribution throughout the class-interval. 
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The corrections under point (2) may be defined more briefly.^ 
Let Na = number of items in classes above that containing the 
median, Nb = number of items in classes below that containing 
the median, and c == Md — 0, where Md is the median and 
0 is the arbitrary origin. The quantity c will, of, course, be 
positive or negative, depending on the relative values of Md 
and 0, and this sign should be retained throughout the calcu- 
lations. The correction noted in (2) is then given by 

(Nb - Na)c 


which is to be added (algebraically) to the sum referred to in 
(1). In the present instance we have, as the required correction, 

(1,785 ~ 1,911) X ( + 0.61) = - 76.86. 

(3) Taking account of point (3) now, we must measure the 
deviations from the median of the 520 observations hitherto 
neglected. These are the observations falling within the class- 
interval that contains the median. This class-interval extends, 
on the a;“Scale, from 45.0 to 50.0. The value of the median is 
48.11. If the 520 observations are uniformly distributed be- 
tween 45 . 0 and 50 . 0, the number falling between 45. 0 and 48 . 11 
may be computed by the direct proportion 


3.11 

5.0 


X 520 = 323.4. 


Similarly, for the number of observations between 48.11 and 
50.0, we have 

X 520 = 196.6. 

5.0 

On the assumption of uniform distribution, the average deviation 
from the median of the 323.4 observations falling between 45.0 


and 48.11 is 1.555 




For the sum of the deviations 


of this group from the median, we have 

323.4 X 1.555 = 502.887. 

Similarly, the average deviation from the median of the'196.6: 

( 1 89 \ 

i.e., |- 

^Cf. A Eandbooh of Mathematical Statistics^ H. L. Rietz, editor, Boston, . 
Houghton Mifflin, ,1924,^30. , ■ , 
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, For the sum of the. deviations of 'this group from the median 
'we,' have 

■(196.6)-X .945'-=: 185.787. 

The sum of the deviations from, the median of all the obser\m- 
tions.in the class containing the median is 

■ ' 502,887 +: 185.787 -== 688.674. 

, In more general terms, the correction noted,, in (3) may be 
defined as ,follows. We have c = Md — 0; let' i = class-inter- 
val. and let == number of observations in the class-interval 
in which the median lies. The sign of c must be retained in the 
calculations. ■ For.,the number .-of items in. that portion of this 
class-interval which falls below the median, we have 



The average deviation of these items from the median is 



2 


The sum of the deviations irom the median of the items in this 
segment of The class-interval containing ' the median is the 
product of these two quantities, or 



For the number of items in . that portion of this class-interval 
which lies above the median, we have 



The average ^deviation' of - these items from the median is 


i 



2 
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The sum of the deviations from the median of the items in this 
segment of the class-interval containing the median is given by 



Accordingly, the total correction referred to under (3), on p. 142, 
or the sum of the deviations from the median of the items within 
the class-interval containing the median, is 



The nature of these formulas may be made clearer by insertion 

of the values in the example cited above. 

In the final computation of the mean deviation we must 
apply to the sum referred to under (1), on p. 142, the two cor- 
rections noted under (2) and (3) on p. 142. Erom (1) we have 
56,610; the correction under (2) is — 76.86; the correction 
under (3) is -f 688.67. The sum of the deviations from the 
median is, therefore, 57,221 . 81. For the mean deviation from 
the median, we have 

M.D. = = 13.573. 

The mean deviation from the mean may be computed by 
an identical process. 

THE STANDARD DEVIATION 

The process of calculating the mean deviation is alge- 
braically illogical because algebraic signs are disregarded. 
In the computation of the standard deviation this error is 
avoided and a measure of more precise mathematical sig- 
nificance is secured. The conventional symbol for the 
standard deviation is the Greek letter sigma, <r. 

In computing this measure the deviations of the indi- 
vidual items from the arithmetic mean i 
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the mean of the squared deviations obtained, and the square 
root of this mean extracted. The standard deviation is, 
thus, the square root of the mean of the squared deviations. 
This measure is also termed the root-mean-square deviation, 
a useful name because it describes in full the method of 
calculation. The deviations are always measured from the 
arithmetic mean, as the value of the measure is a minimum 
under these conditions. A simple example will illustrate the 
process (Table 33). 


Table 33 



Compidation of Standard Deviation 

m 

f 

d 



3 

1 

- 6 

36 

M = 9 

6 

1 

-- 3 

9 

9 

1 

0 

0 

/oo 

12 

1 

+ 3 

9 

cr= 

15 

1 

5 

+ 6 

36 

90 

r 0 

= vn 

<r = 4.24 


When the standard deviation is computed from ungrouped 
data, as in this case, the formula ^ is 



When the items are grouped in a frequency distribution 
the task of computation is a little more complicated. The 
measurement of deviations from an arbitrary origin is essen- 
tial in this case, as it greatly simplifies the calculations. 

^ This formula is used in statistical descriptioriy which is the concern of this 
section of the book. If our purpose is to use results secured from a sample as 
estimntes of the attributes of the population from which the sample has been 
drawn, a slight modification is desirable. It lias been shown that the estimate 
of the true standard deviation is improved if AT— 1 be used as the divisor in the 
formula, in place of iV. The difference is slight for estimates based on large 
samples, important for small ones. 
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The general formula for the standard deviation is 



where / represents the class-frequencies, d the deviations 
from the arithmetic mean and N the number of cases 
included. It follows that 



If a deviation from an arbitrary origin be represented by 
d' and the root-mean-square deviation from this origin be 
represented by we have 



The root-mean-square deviation from the mean (cr) is less 
than the root-mean-square deviation from any other point 
on the scale. Hence is greater than cr^. We may repre- 
sent by c the difference between the true mean and the 
arbitrary origin. It may be readily established ^ that 



The value of the standard deviation may be most easily 
determined, therefore, by computing sj' and cl The opera- 
tions involved are illustrated in detail in Table 34, showing 
the distribution of 11,404 steel workers, classified on the 
basis of average hourly earnings in 1933. 


^.For 


%- 


N 


N. '■ 

« d -f" :C ; 

(dO^ «= + 2cd + c® ■' 


but Sd =» 0 

+ Nc^ 




N 


N 


sss -L. /»2 



Average 

cL ^ ^ ® 

irderval p,, 

^ ^ fwimm 


:■ ,( 2 ) 

(3) 

(4) 

Jfwi- 

Fre- 

Deviation 

p&iiU 

frmi 

(cents) 

gwncy 

ufbiimry 


origin 

. m 

f 

d* 

17.5 

41 

^ 9 

22.5 

54 

- 8 

27.5 

342 

- 7 

32.0 

1,158 

~ 6 

37.5 

2,103 

- 5 

42.5 

2,063 

- 4 

47.5 

1,433 

- 3 

52.5 

1,131 

- 2 

57.5 

775 

- i . 


15.0- 19.9 17.5 41 

20.0- 24.9 22.5 54 

25.0- 29.9 27.5 342 

30.0- 34.9 32.5 1,158 

35.0- 39.9 37.5 2,103 

40.0- 44.9 42.5 2,W3 

45.0- 49.9 47.5 l.’Sg 

50.0- 64.9 52.5 1 131 

56.0- 59.9 67.5 775 

60.0- 64.9 62.5 473 

S-0-69-9 67.5 4 ;? 

75.0- 79.9 77.5 216 

80.0- 84.9 82.5 193 

85.0- 89.9 87.5 117 

90.0- 94.9 92.5 111 

95.0- 99.9 97.5 ro 

100.0- 104.9 102.5 71 

106.0- 109.9 107.5 103 

110.0- 114.9 112.5 u 

116.0- 119.9 117.5 58 

120.0- 124.9 122.5 27 

125.0- 129.9 127 6 iq 

1^.0-134.9 132.5 19 

J3o-0-139.9 137.5 14 

140.0- 144.9 142.5 10 

145.0- 149.9 147.5 2 

150.0- 154.9 152.5 4 

105.0- 159.9 167.5 0 

160.0- 164.9 162.5 1 

11,404 

IV = 11,404 

Class-interval =6.0 cents 


fd’ Mr 

- 369 3,321 

432 3,456 

- 2,394 10,758 

- 6.948 41,688 

- 10,515 52,575 

- 8,252 33,008 

- 4,299 12,897 

- 2,262 4,5 

- 775 r 

0 

457 4; 

608 1,21 
648 1,94 

772 3,08 

585 3,92 

666 3,99 

434 3,031 

568 4,54- 

927 8,34: 

340 3,40{ 

638 7,01s 

324 3,888 

247 3,211 

266 3,724 

210 3,150 

192 3,072 

34 578 

72 1,296 

38 722 

^ 400 

28,200 2^;oi2 


(c^'-f-l)“ 

64 

49 

36 
25 . 
16 
9 
4 


4,524 

1 

775 

0 

0 

1 

457 

4 

1,216 

9 

1,944 

16 

3,088 

25 

2,925 

36 

3,996 

49 

3,038 

64 

i} 0*4*4 

8,343 

81 

100 

3,400 

121 

7,018 

144 

3,888 

169 

3,211 

196 

3,724 

225 

3,150 

256 

3,072 

289 

578 

324 

1,296 

361 

722 

400 

400 

'441. .. 


c (in class-interval units) - 

~ ^f404~ = - 2.4728 

c (m class-interval units) = -(.6.1147 

(in class-interval units) = _ 229,012 

(in class-interval units) = .f ^ ^ 
class-interval units) = 3‘‘737‘' 26-0817 - 6.1147 = 

o (in original units) = 3’7q7 s., k „ 

3.737 X 5.0 cants = 18.685 ceni 
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The entire calculation, it will be noted, is carried through 
in terms of class-interval units, the result being reduced to 
the original units in the final operation. In computing c, 
the difference between the true mean and the arbitrary 
origin, the algebraic sum of the deviations is divided by the 
munber of cases. The arithmetic mean could be deter- 
mined by reducing c to original units and adding this value 
(algebraically) to the value of the arbitrary quantity se- 
lected as origin, but this is not an essential step. The actual 
value of the mean need not be known in the computation of 
the standard deviation. 

A check upon the accuracy of the calculations (the Charlier 
check *) is afforded by the figures in cols. (7) and (8) of 
Table 34. If deviations be measured, not from the arbitrary 
origin employed in computing the standard deviation, but 
from an origin one class-interval below, we secure a set of 
values equal to d' + 1. The squares of these values are 
given in col. (7). Multiplying by the corresponding fre- 
quencies we have the quantities recorded in col. (8), the 
sum of which is 184,016. This total stands in a definite 
relationship to the values seemed in computing the standard 
deviation. For 

S/(d' + 1)' = S/[(dO' + 2d' + 1] 

= S/(dO^ + 2S/d' + S/ 
or S/(d' +iy = S/(dO" + 2S/d' + N. 

Inserting in this last equation the values secured from 
the calculations shown in Table 34, we obtain this check: 

184,016 = 229,012 + 2(- 28,200) + 11,404 
= 184,016. 

The following is a summary of the steps in the process of 
computing the standard deviation of items grouped in a 
frequency distribution: 

‘ Of. C. V. L. Charlier, VorUsungen t!h& Die Orundzuge Der Mathematischsn 
/Sio/isiifc, Lund, Verlag Scientia, 1920, 19. 
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2. the deviations from this point of th(>itpm<{Jno„ u r 

mcls^interval unite. Multiply th, *'.Ll!Z h"?.'l' 

sponding class-frequencies ' * corre- 

Xu«cii™‘“”'' '■>' ---•pontliug cite. 

5. Divide the sum of the squared deviation*^ hv V Ti- • 

in class-interval units ^ 

6. From the formula, cr“ = s.* - f-onnt.or - 

square root of this value seeurin’r.^ in <:Jie 

7. Mnltiply^, asthuscompui;d;by the , 

IS 0 - m the original units of measurement. ' 

Certain of the characteristics of the sfandsn^ • r 
and Its relation to other measurp« r deviation 

in a later section » dispersion are described 
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calculation and immediate significance this quartile deviation 
has distinct merits. 

Within the range between the two quartiles, of course, 
one half of all the measures are included. The greater the 
concentration the smaller this interval, hence a fairly accu- 
rate measure of dispersion may be obtained from the rela- 
tionship between these two quartiles. The quartile deviation 
is the semi-interquartile range, half the distance along the 
scale between the first and third quartiles. Thus if Q.D. 
represent the quartile deviation, Qi the first quartile and 
Qi the third quartile, 

Q.D. = 

2 

If the value of a point on the scale half-way between 
the first and third quartiles is represented by K, one half 
of all the measures in a frequency distribution will fall 
within the range K ± Q.D. For the data in Table 32, 
relating to the hourly earnings in 1933 of steel workers in 
the Great Lakes and Middle West District, we have 

Qi = 39.07 

Qi = 59.03 

QB = 59.03 - 39.07 
= 9.98 

Z = 39.07 •+■ 9.98 
= 49.05. 

Thus one half of all the measures lie within the range 
49.05 dr 9,98. This statement, together with a statement 
of the average hourly earnings in 1933 (mean, median, or 
mode), constitutes a useful description of the distribution. 
In a perfectly symmetrical distribution the value of K will 
coincide with the value of the median (that is, the median 
will lie half-way along the scale from Qi to Qg). The dis- 
tribution of wage rates is slightly asymmetrical, the value of 
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the median being 48.11, as compai-ed with the value of 
49.05 for K. 

THE PROBABLE ERROR 

In studying the results of astronomical and other physi- 
cal measurements it has been found that the values secured 
by different observers for the same constant quantity vary. 
These varying results, however, are distributed in a certain 
definite way, and when plotted give a curve similar to the 
normal curve of error. In such cases there is an inamediate 
and obvious need of some measiue of variation which may 
be used as an index of the reliability of given results. If 
the results secured by different investigators, or by the 
same investigator at different times, vary widely they can- 
not be accepted as reliable, while the reverse is true if the 
variation is slight. The measure of dispersion which has 
been generally employed in such cases is termed the prob- 
able error. The probable error is that amount which, in a 
given case, is exceeded by the errors of one half the ob- 
servations. Since the most probable value of a given series 
of observations is their arithmetic mean, the probable error 
is always measured from the mean. The name of this 
measure derives from the fact that the probability that a 
given observation will vary from the mean of all the ob- 
servations by an amount greater than the probable error 
is exactly |. It follows that, when the observations are 
arranged in the form of a frequency distribution, a distance 
equal to the probable error laid off on each side of the arith- 
metic mean will define limits within which one half of the 
total number of cases will fall. 

This measure of variation has been employed in fields 
other than that in which it was originally applied, fields in 
which the name probable error is somewhat misleading. In 
such cases it is perhaps better to think of it as the probable 
deviation, that distance from the mean which will be ex- 
ceeded by one half of the total deviations. 
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The probable error is a measure of dispersion -which is 
fully significant only when it applies to a distribution fol- 
lowing the normal law of error. In such cases it has a 
definite and precise meaning. This is not so when it is 
applied to skew distributions, and its use in suck cases 
is not advisable. The quartile de-vdation, the value of which 
is equal to that of the probable error in a normal distribu- 
tion, has a more direct significance than the probable error 
in the description of abnormal distributions, and should be 
employed in such cases. In a later section the use of the 
probable error as a measure of the reliability of statistical 
results is more fully explained. 

The value of the probable error in a given case, assuming 
a normal distribution to prevail, may be determined from 
the value of the standard deviation, for there is a constant 
relationship between these two. This is expressed by the 
formula; P.E. = 0.6745a-. 

Relations between Different Measures of 
Variation 

An understanding of the significance of the various meas- 
ures of dispersion described above may be facilitated by a 
general comparison and a summary statement of the rela- 
tions holding between them. 

1. The range is a distance along the scale within which all the 

observations lie. 

2. The quartile deviation or semi-interquartile range is a distance 

along the scale which, when laid off on each side of the point 
midway between the two quartiles, includes one half the total 
number of observations. 

3. The mean deviation from the mean, in a normal or slightly 

skew distribution, is equal to about i of the standard de-via- 
tion. A range of 74- times the mean deviation, centering at the 
mean, will include approximately 99 per cent of all the cases. 

4. When a distance equal to the standard deviation is laid off on 

each side of the mean, in a normal or only slightly skew dis- 
tribution, about two thirds of all the cases will be included. 
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(In the normal distribution exactly 68*26 per cent of the obser- 
nations. be included.) When a distance equal to twice the 
standard deviation is laid off on each side of the mean approxi- 
mately 95 per cent of the cases will be included (exactly 95 .46 
■ per cent in a normal distribution). When a distance equal to 
"/■ three times the standard deviation is laid off on each side of 
the mean about 99 per cent of all the observations will be 
included (exactly 99.73 per cent in a normal distribution). 
This general rule that a range of six times the standard devia- 
tion, centering at the mean, will include about 99 per cent of 
all the measures furnishes a useful check upon caiculations. 

A study of Fig. 45 may help to make clear the significance of 
the standard deviation in a normal distribution. 

5. The probable error, in a normal distribution, is equal to 0 . 6745(r. 
A range of twice the probable error, centering at the mean, 
will include 50 per cent of all the observations. A range of 
eight times the probable error, centering at the mean, will 
include approximately 99 per cent of all the observations. 

Characteristic Features op the Chief Measures 
OF Variation 

The range 

1. The range is easily calculated and its significance is readily 

understood. As a rough measure of the degree of variation 
the range is useful. 

2. The value of the range is determined by the values of the two 

extreme cases. It is thus a highly unstable measure, the 
value of which may be greatly changed by the addition or 
withdrawal of a single figure, 

3. This measure gives no indication of the character of the distri- 

bution within the two extreme observations. 

The quartile deviation 

1. The quartile deviation is a measure of dispersion that is easily 

computed and readily understood. It is superior to the range 
as a rough measure of variation. 

2. The quartile deviation is not a measure of the variation from 

any specific average. 

3. This measure is not affected by the distribution of the items 

betw^een the first and third quartiles, or by the distribution 
outside the quartiles. The values of the quartile deviation 
might be the same for two quite dissimilar distributions, pro- 
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vided the quartiles happened to coincide. Because it is not 
affected by the deviations of individual items it Ganriot be 
accepted as an accurate measure of variation. 

4. The qiiartile deviation is not suited to algebraic treatment. 

The mean deviation 

1. The mean deviation is affected by the value of every observa- 

tion. As the average difference between the individual items 
and the median (or mean) of the distribution it has a precise 
significance. 

2. The mean deviation is less affected by extreme delations than 

the standard deviation. 

3. Mathematically, the mean deviation is not as logical or as con- 

venient a measure of dispersion as the standard deviation. 

The standard deviation 

1. The standard deviation is affected by the value of every ob- 

servation. 

2. The process of squaring the deviations before adding avoids the 

algebraic fallacy of disregarding signs. 

3. The standard deviation has a definite mathematical meaning 

and is perfectly adapted to algebraic treatment. 

4. The standard deviation is, in general, less affected by fluctua- 

tions of sampling than the other measures of dispersion. 

6. The normal curve of error has been analyzed in terms of the 
standard deviation. The information thus obtained has 
increased greatly the utility of the standard deviation. 

The probable error 

1. The probable error has a* definite meaning in the case of a dis- 

tribution following the normal law. It has not this precise 
meaning for other distributions, and should not be employed 
in describing them. 

2. For distributions to which it is adapted, the probable error is an 

extremely useful measure. Its most important use is as an 
index of the magnitude of errors of sampling. 

3. The definite relationship between the probable error and the 

standard deviation, for a normal distribution, permits the 
value of the probable error to be readily determined. 

All the measures of variation described above may be 
utilized for particular purposes. The standard deviation^ 
however, is the best generaL measure and should : be em- 
ployed in all cases where a high degree of accuracy is re- 


156 FEEQUENCY DISTRIBUTION 

quired. The probable error is, in effect, merely a fractional 
part of the standard deviation, with a definite but restricted 
field of usefulness. 

The Measubement of Relative Vakiation 

We have been dealing in the preceding section with 
absolute variability. The various measures of dispersion 
secured by the methods outlined describe the variability 
of the data in terms of absolute units of measurement. 
The standard deviation of London-Paris exchange rates is 
in francs, the standard deviation of pig iron production in 
tons, etc. If the object in a given case is the description of 
a single frequency distribution it is desirable that the orig- 
inal unit be employed throughout, but if measures of varia- 
tion of two different distributions are to be compared, diffi- 
culties are encountered. This is clear if the units are unlike, 
but even if the units are identical the same difficulty arises. 
Thus measures of variation in the weights of dogs and in the 
weights of horses might both have been computed in pounds. 
Because the standard deviation of horse weights is greater 
than the standard deviation of dog weights, it does not fol- 
low that the degree of variability is greater in the former 
case. A measure of absolute variation is significant only in 
relation to the average from which the deviations are meas- 
ured. Its use, apart from this average, is meaningless. For 
comparison, therefore, it must be reduced to a relative form, 
and the obvious procedure is to express a given measure 
of variation as a percentage of the average from which the 
deviations have been measured. The quantity thus becomes 
an abstract number, a measure of the relative variability 
of the given observations, and may be compared vdth similar 
terms computed from other distributions. 

THE COEFFICIENT OP VAEIATION 

The measure of relative variation most commonly em- 
ployed is that developed by Pearson, termed the coefficient 
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of variation, and represented by the letter F. It is simply 
the standard deviation as a percentage of the arithmetic 
mean. Thus 


F = ^ X 100. 

M 


Applying this formula to the results secured from the an- 
alysis of the distribution of steel workers, classified accord- 
ing to hourly earnings in 1933 (Table 34), we have 

V = X 100 

50.136 

= 37.27%. 

This measurement may be compared with a similar coeffi- 
cient relating to the distribution of workers in open-hearth 
furnaces, classified according to average hourly earnings in 
1935. In that year the mean wage was 71 . 946 cents and the 
standard deviation 28 . 55 cents. From these 


28.55 

71.946 


X 100 


= 39.68%. 


Variations of hourly earnings among steel workers was 
greater in 1935 than in 1933. The difference was not as 
great, however, as a comparison of standard deviations would 
indicate. The average wage advanced appreciably between 
1933 and 1935 and the relative variation increased only 
moderately. 

An index of variability similar to this coefficient might 
be secured by expressing any of the other measures of 
deviation as a percentage of the average from which the 
deviations were computed. Pearson’s coefficient has been 
generally adopted, however, and is the only one in wide use. 

Measuees of Skewness 

Methods have been developed in the preceding sections 
for describing the central tendency of a frequency distri- 
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bution and for measuring the degree of concentration or 
lack of concentration about that central tendency. One 
further measure is needed, and that is one which indicates 
the degree of skewness or asymmetry of a given distri- 
bution. For it is essential to know, in regard to a given 
distribution, whether the observations are arranged sym- 
metrically about the central value, or are dispersed in an 
uneven, asymmetrical fashion about that value. Having 
such a figure it will be possible effectively to summarize 
the characteristics of a frequency distribution in three sim- 
ple terms — an average, a measure of dispersion and a 
measure of skewness. There are two measures of skewness 
in current use. 

If a frequency curve is perfectly symmetrical, mean, 
median, and mode will coincide. As the distribution de- 
parts from symmetry these three values are pulled apart, 
the difference between the mean and the mode being great- 
est. This difference may be used, therefore, as a measure 
of skewness. It is desirable in this case, as in measuring 
relative variability, to secure an index in the form of an 
abstract number, which may be compared with similar fig- 
ures derived from other distributions. To this end, Pearson 
has proposed dividing the absolute difference between mean 
and mode by the standard deviation of the given distribu- 
tion. His formula is 

sk (skewness) = — 

tr ' 

In a symmetrical distribution, where mean and mode coin- 
cide, the value of this measure will be zero. Under other 
conditions the value may be positive or negative, depending 
upon the relative positions of the two averages on the scale. 

For moderately skew distributions the degi’ee of skew- 
ness may be computed more readily from the formula 

_ m - Md) 
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This corresponds approximately to the other formula, be- 
cause of the fact that in a moderately asymmetrical distri- 
bution the median lies between the mean and the mode, 
about one third of the distance from the former towards 
the latter. 

Because it is difficult to locate the mode by simple meth- 
ods, a measure of skewness more easily computed than 
Pearson’s is desirable in some cases. Bowley has proposed 
such a method, based upon the relationship between the 
first and third quartiles and the median. If the distribution 
is symmetrical these two quartiles will be equidistant from 
the median; with an asymmetrical distribution this is not 
so. Therefore, if we let represent the diffei'ence between 
the upper quartile and the median and qi represent the 
difference between the median and the lower quartile, we 
may use the formula 

+ gi 

as a means of securing a measure of skewness. This value 
will vary between 0 and ± 1. For with perfect symmetry 
qi = qi, and the measure is 0; with asymmetry so pro- 
nounced that the median and one of the quartiles coincide, 
either q^ or qi becomes equal to 0, and the formula gives 
a value of -h 1 or — 1. Bowley suggests that a value of .1 
indicates a moderate degree of skewness, while a value of 
.3 indicates marked skewness. 

The values secured from this measure are not, of course, 
comparable with the values secured from the application 
of Pearson’s formula for measuring skewness. 

KtrKTOSIS 

Reference has been made to a fourth measurable char- 
acteristic of frequency curves. This is the degree of flat- 
toppedness, as compared with the normal curve. A measure 
of kurtosis, the technical term for this characteristic, is 
given in Chapter XIII. 
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CHAPTER VI 


INDEX NUMBERS OF PRICES 
The Nature of Index Numbers 

The term “index number” has been applied to a number 
of somewhat similar devices employed in the analysis of 
statistical series. Index numbers have been most widely 
used in the study of price changes, but a brief considera- 
tion of certain other uses may make clear the essential 
characteristics of such measures. In its simplest foim this 
name is applied to a term in a time series expressed as a 
relative number. Thus an index number of cotton consump- 
tion in the United States might take the following form: 

Table 35 


Domestic Cotton Consumption in the United States, 
1926-1936 

(Consumption in year ended July 31, 1926 == lOO) 


Year ended 
July 31 

Cotton conswn'ptwn 
{unit: one thousand 

Cotton consumption 
relative 

running bales) 

1926 

6,456 

100 

1927 

7,190 

111 

1928 

6,834 

106 

1929 

7,091 

110 

1930 

6,106 

95 

1931 

5,263 

82 

1932 

4,866 

75 

1933 

6,137 

95 

1934 ' 

5,700 

88 

: . 1935 

5,361 

83 

1936 

6,351 

98 


Similarly the price of a commodity may be expressed as 
a relative, the price at a given date or for a given period 
serving as base. 
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Table 36 

Average Price of No. 1 Nort/wrn Sprhw Wheat M ', i 

ims, w2QSm ’ 


■ ■ f'¥ nm 

1913, 1929-J936 

(Average price in year ended June 30, I9i3 
endiir , . ■ 


Calendar 

near 

1913 

1929 

1930 

1931 

1932 

1933 

1934 

1935 

1936 


Weighted average 
price per bushel 

SO. 874 
1.276 
0.984 
0.739 
0.605 
0.770 
1.026 
1.165 
1.247 


= 100) 

lieliitive 

price 

100 

146 

113 

85 

69 

88 

117 

133 

143 


t ™ 

comparison of the values Possible a reac 

to follow the trend of thA ® enables oi 

when the data are pLttedi^^^^^^^ more eas% tha 

son of the trends o^ differAnt ^ompar 

Though the term facilitated, 

relatives it is better praeticr 

which represent the eombin-iti lor figure 

The series to be combS rln f ^ series 

consumption, wages volume of production 

ject to temporal variation. (IndeT Jumh 
also m measuring such geographical dW 
from variations in living costs fmr« ^ differences as arise 
country to country) oLa t ^ ^ from 

volved in tie co ‘-‘- 

of index numbers but the o ^ nf these special forms 

»eure a single, silk" »o 

of the changes occurring in the cor> resultants 

A simple index numhA,. constituent elements. 

the course of coal and net^r^ ^ constructed to represent 

making of 

iiiuex It IS necessary to 
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combine in some way production figures for bituminous 
and anthracite coal and petroleum. The production figures 
and the corresponding relatives for the three series, from 
1922 to 1936, are given in Table 37. 


Table 37 


Production of Bituminous and Anthracite Coal and Petroleum 
in the United States, 1922-1936 

(Production in 1922 = 100) 



Prod» of 


Prod, of 


Prod, of 


Year 

ML coal 
{million 

Rel 

milhr. coal 
{million 

ReL 

petrol 

{niilUofi 

R.eL 


sL tons) 


sL tons) 


bUs.) 


1922 

422.3 

100 

54.7 

100 

,557.5 

100 

1923 

564.6 

134 

93.3 

171 

732.4 

131 

1924 

483.7 

115 

87.9 

161 

713.9 

128 

1925 

520.1 

123 

61.8 

113 

763.7 

137 

1926 

573.4 

136 

84.4 

154 

770.9 

138 

1927 

517.8 

123 

80.1 

146 

901.1 

162 

1928 

500.7 

119 

75.3 

138 

901.5 

162 

1929 

535.0 

127 

73.8 

135 

1,007.3 

181 

1930 

467.5 

111 

69.4 

127 

898.0 

161 

1931 

382.1 

90 

59.6 

109 

851.1 

153 

1932 

309.7 

73 

49.9 

91 

785.2 

141 

1933 

333.6 

79 

49.5 

90 

905.7 

162 

1934 

359.4 

85 

57.2 

105 

908.1 

163 

1935 

372.4' 

88 

52.2 

95 

996.6 

179 

1936 

434.1 

103 

54.8 

100 

1,098.5 

197 


A rough index of fuel production, based upon these three 
series, is desired. It is impossible, obviously, to add the 
original figures, as the units are not the same. This diiR- 
culty may be avoided by using the relative figures. A simple 
average of the three relatives for a given year may serve 
as the required index. Index numbers thus secured are 
given in Table 38 on page 164. 

In securing this index, by adding the three relative fig- 
ures for a given year and dividing by three, equal weight 
has been given to each of the three series. Such an index of 
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Table 38 

Index Numbers of Coal and Petrokum ProdMtion in the 
United States, 1922 -1936 
(Production in 1922 = 100) 



Index 

Year 

Index 

1922 

100 

1930 

138 

1928 

145 

1931 

117 

1924 

135 

1982 

102 

1925 

124 

1988 

110 

1926 

148 

1984 

118 

1927 

144 

1985 

121 

1928 

140 

1986 

188 

1929 

148 




equally weighted relatives has been termed an unweighted 
index, but the term is misleading. Weights are used, the 
weights in this case being equal. It is clear that this index 
based upon equal weights does not reflect faithfully the 
three series combined in the present instance. For the three 
series are not of equal importance, as the system of equal 
weights assumes. The following figures showing the whole- 
sale values in exchange in 1926 of bituminous coal, anthra- 
cite coal, and crude petroleum indicate the relative im- 
portance of the three series: ^ 

Wholesak value in 
exchange in 1926 

$2,157,740,000 
888,141,000 
1,355,989,000 

Roughly, these stand to one another in the relation of 
5, 2, and 3, and these weights may be assigned to the series 
under consideration. An index for each year may be com- 
puted, using these weights. The example in Table 39, 
showing the calculations for the years 1922 and 1923, will 
illustrate the method. 


Mineral 

Bituminous coal 
Anthracite coal 
Petroleum 


*The figures have been eompiled by the U. S. Bureau of Labor Statistics. 
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Tablis 39 


Computation of Weighted Index Numbers of Coal 
and Petroleum Production 


Mineral 

Relative 

prodnction 

1922 

Wt 

Wt, X ReL 

Relative 

production 

1923 

Wt. \ 

WL X Mel, 

Bituminous coai 

100 

5 

500 

134 

5 

670 

Anthracite coal 

100 

2 

200 

171 

2 

342 

Petroleum 

100 

3 

300 

131 

3 

393 



10 

1,000 


10 

1,405 


Index of fuel production, 1922 = 1,000 -f- 10 = 100 
Index of fuel production, 1923 = 1,405 10 = 141 


The value of the index thus secured for each of the fif- 
teen years covered is shown in Table 40. 


Table 40 


Weighted Index Numbers of Coal and Petroleum 
Production in the United States, 1922-1936 


Year 

Index 

Year 

Index 

1922 

100 

1930 

129 

1923 

141 

1931 

113 

1924 

128 

1932 

97 

1925 

125 

1933 

106 

1926 

140 

1934 

112 

1927 

139 

1935 

117 

1928 

136 

1936 

131 

1929 

145 




Differences between the two series of index numbers are 
to be expected. The second series, which is the more log- 
ically weighted, is, of course, the more accurate of the two, 
and gives a more faithful representation of the combined 
effect of the forces affecting the output of coal and petroleum. 

Another type of index number is one in which the items 
in the constituent series are tcrtaled, the aggregate figure, 
instead of an average, serving as the representative of the 
entire group. Such a form of index niunber may be con- 
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structed only when the different series are all expressed in 
the same unit. This form is frequently employed as an 
indication of changes in the level of prices, the aggregate 
cost of a bill of goods at one period being compared with 
the aggregate cost of the same goods at ol her dates. The 
figures in Table 41 illustrate this type of index. 

Table 41 


Bradsireet’s Index of Wholesale Prices in the 
UmM States, 1926-1937 ' 


Year 

Index 

Year 

Index 

1926 

13.02 

' 1932 ^ 

. 7.10 

1927 

12.78 

1933 ■ ' 

7 . 86 , 

1928 

13.28 

■ 1934 '' 

' 9.22 

1929 

12.67 

1935 ■ 

■ 9.92 

1930 

10.75 

1936 

10.10 

1931 

8.76 

1937 . 

11.06 


Each of the yearly aggregates quoted above is the sum 
of the average prices during the year of 96 commodities at 
wholesale. Before being added all the prices are reduced to 
the “per pound” basis, so that a certain degree of compara- 
bility is secured. Such an index may be readily changed to 
the relative form, any year being taken as a base and the 
totals for the other yeai’s expressed as percentages of the 
figure for the base year. 

The examples which have been given will indicate some 
of the many forms which index numbers may take. The 
term may refer to a simple relative number; it may be 
applied to an average of relative terms, or to an aggregate 
of relative or absolute figures. In all the examples given 
the index has been designed to serve as a measure of change 
over a period, as an indicator of changes in the values of 
time series. The term may have a much broader meaning 
than this. An index of the ability of salesmen might be 
constructed by giving numerical values to the factors deter- 

‘ Construction of this index was discontinued at tlie end of 1937 . 
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mining their usefulness and securing an average of these 
values. An index of the efficiency of different departments 
in a business enterprise might be constructed. In any case, 
the construction of an index involves the reduction to com- 
parable terms of a number of different factors and the 
replacement of these several terms by a single figure which 
may serve as their representative. Comparison is thus 
facilitated, whether it be comparison over time or space' or 
comparison with other indices secured by averaging terms 
relating to a similar unit. In all its forms (except the first 
limited and exceptional meaning in which it applies to a 
simple relative) an index number is thus a type of statistical 
average, and such numbers, in their construction and use, 
are subject to all the rules and limitations set forth in the 
development of the subject of averages. 

In the present work we are interested only in the applica- 
tion of the index number device to time series. So varied, 
however, are the rules and practices relating to its applica- 
tion to different types of time series that certain of these 
types must be treated separately. Our first concern is with 
index numbers of wholesale prices. 

Price Changes 

When price movements are surveyed in detail it is diffi- 
cult to perceive order, or any definite trend. We find a mul- 
tiplicity of conflicting movements. The price quotations 
in Table 42 (on page 168), taken at random, are roughly 
typical of what would be found were the entire field of prices 
canvassed in order to compare price movements from month 
to month. 

Of the sixteen commodities listed, five showed no price 
change at all between October and November, 1937, two 
showed price increases, and in nine cases prices declined. 
Some of the price movements were inconsiderable, while 
some marked very material changes. Such, as seen here 
in miniature, is what happens in the price system as a whole. 
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Table 42 


Cmimodity Prkes at Whohmh ^ 




Price 

Price 

Commodity 

Unit 

(ichokmle) 

(wholesale) 



October^ 

November^ 



1937 

1937 

Briek, common building, aver- 




age of yard prices 

1,000 

S12.113 

$12,113 

Pig iron, basic, Valley furnace 

Gross ton 

23.500 

23.500 

Cement, Portland, average of 




plant prices 

Bbl 

1.667 

1.667 

Linseed oil, raw, N. Y. 

Pound 

.110 

.106 

Steel billets, rerolling, Pitts, 

Gross ton 

37.000 

37.000 

Steel, scrap, Chicago 

Gross ton 

14.688 

12.500 

Copper, electroL, refinery 

Pound 

.119 

.108 

Lead, pig, N. Y. 

Pound 

.058 

.051 

Zinc, pig, N. Y. 

Pound 

.065 

.060 

Coal, anthr., chestnut, average 




of 15 price series, on tracks, 




destination 

Net ton 

9.472 

9.610 

Coal, bit., mine run, average of 




27 price series, on tracks, des- 




tination 

Net ton 

4.305 

4.303 

Crude petroleum, Penn., at wells Bbl. 

2.413 

2.350 

Gasoline, motor, California, re- 




finery 

Gal. 

.083 

.085 

Cotton, middling, N, 0. 

Pound 

.083 

.080 

Wheat, no. 2 red winter, Chi. 

Bu. 

1.033 

.951 

Sugar, granulated, N. Y. 

Pound 

.048 

.048 


AH prices do not, with absolute uniformity, move up or 
down or remain constant. Each of the thousands of com- 
modities traded in on the markets of any country, or of the 
world, moves in its own individual way, subject to a variety 
of influences. Yet it does not act in isolation. In its price 
movements it affects other commodities, and is affected 
by them. And, in addition to the forces peculiar to each 
commodity, there are broad forces which act throughout 
the price system, affecting all commodities. It is the busi- 

* As compiled by the U. S* Bureau of Labor Statistics. 
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ness of the economic statistician to bring order out of the 
chaos of price movements taking place at any given time 
and, out of the multiplicity of minor movements, to pick 
the broad trends which affect the whole economic sys- 
tem. 

The forces bringing about the price movements that are 
to be studied are numerous and complicated, but some 
general conclusions may be drawn with regard to them. 
There are, in the first place, all those changes in production 
and consumption conditions peculiar to individual commodi- 
ties and affecting directly the prices of those commodities. 
The opening of new fields, improvements in production 
technique in individual cases, changes in fashion and the 
transfer of demand from some commodities to others, changes 
in demand and supply with the seasons — all these are 
causing constant price readjustments. These are the changes 
which in ordinary times are most obvious, which are brought 
home directly to the individual merchant or consumer. 
Such changes affect the whole price system, as has been 
pointed out, but not in general by causing upward or down- 
ward movements in the system as a whole. 

These general movements are due to forces that are 
broader in their scope. The general improvement in pro- 
duction technique and the increase in the productivity of 
human labor which has resulted have, by increasing the 
supply of commodities available for consumption, affected 
prices. Changes in monetary systems and, in particular, 
changes in the gold supply have exerted a direct and imme- 
diate influence upon prices, by affecting the supply of money 
in circulation. Similar in character have been changes in 
banking and credit systems and changes m commercial 
practice that have affected the use of credit instruments 
and the rapidity of circulation of money and credits. All 
these forces influence prices, though their incidence is not 
so specific as are those of the factors affecting individual com- 
modities directly. ■ 
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Ptjbpose op General Index Numbers op 
Wholesale Prices 

These separate forces cannot be isolated and evaluated. 
Their joint action causes a perplexing variety of price 
changes. In studying these changes the problem might he 
approached from several different points of view. It might 
be desired to study the readjustments that take place within 
the price system, to determine the nature and degree of the 
shifts within the system that come with changing conditions. 
Such a study would yield valuable information as to the 
behavior of prices and the character of their interrelations. 
Our immediate problem, however, is the determination of 
the net resultant of all these forces. Do all price movements 
cancel each other so that while some prices move up and 
some down there is no net change? Or is there at a given 
time a preponderance of movements in one direction, causing 
the level of general prices to move upward or downward? 
If there is such a trend, what is it, and how may it be meas- 
ured? Are the statistical methods that have been explained 
in the earlier sections applicable to the solution of this 
problem? 

The first step in this study involves the answering of the 
last question asked. It has been brought out that methods 
of summarizing quantitative data have been developed, 
but that these methods are applicable only when certain 
conditions are fulfilled. An average, it was noted, has no 
significance unless it represents a distinct central tendency 
in a mass of homogeneous data. Moreover, the type of 
average to be employed depends upon the character of the 
distribution it is to represent. Until the distribution of 
the original data is studied no average or other statistical 
measure can be intelligently employed. We must first, 
then, determine what the raw materials of the problem are, 
and study the frequency distiibutions secured when these 
raw materials are organized. 
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For the present a quite general purpose will be assumed, 
the determination of the change in the level of general 
wholesale prices between two specific dates. This is equiva- 
lent, of course, to measuring the change in the purchasing 
power of money in wholesale markets. The raw materials 
of the problem consist of a number of price quotations on 
individual conunodities, quotations being secured for the 
two dates to be compared. Each pair of quotations meas- 
ures the change in the price of a single commodity, a change 
caused by the interplay of many forces. When a great 
many such price quotations are brought together we have 
a mass of data representing the interaction of a multitude 
of forces, some individual and specific in their incidence, 
some general, affecting the prices of large groups of com- 
modities or of all commodities. What we seek to determine 
is the net resultant of all these factors. We seek a measure 
of the composite effect of the numerous forces that are 
causing individual prices to rise or fall. This measure will 
constitute an index number of wholesale prices. 

The unit with which we must deal is a single price varia- 
tion. Whether the statistical methods with which we are 
familiar may be employed in the organization and analysis 
of a number of such units depends upon the behavior of 
such units in mass. The following examples illustrate the 
frequency distributions secured when these data are clas- 
sified. 

Fhbquency Distributions of Price Ratios 

Each price variation is, of course, a ratio, the ratio of the 
price of a commodity at a given date to the price of the 
commodity at another date. The ratios may be reduced 
to a comparable basis by putting them all in the form of 
relatives, of the type illustrated in the earlier examples of 
index numbers. Thus, using one of the pairs of price quo- 
tations given above, the ratio of the price of steel scrap 
in November, 1937, to the price in October lQS7 io 
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$12,500: $14,688, which, in the form of a relative, becomes 
85.1: 100. In constructing the following freciuency table, the 
prices at wholesale in 1927 of 670 commodities were ex- 
pressed as relatives, with the 1926 price as a base in each 
case. The distribution of these 670 relative numbers is 
shown in Table 43. 

TABim 43 


Distribution of the Relative Prices of 670 Commodities in 1927 ‘ 



(Average prices in 1926 ~ 100) 


Relatwe prim 

Mid-point 

m 

No, of ernes 

Percentage of total 
fmmber of cases 

52.5- 57.4 

55 

1 

A , 

57.5- 62.4 

60 

2 

.3 

62.5- 67.4 

65 

■■ 6 

.9 

67.5- 72.4 

70 

7 

1.0., ■ 

72.5- 77.4 

75 

8 

1.2 

77.5- 82.4 

80 

25 

3.7 

82.5- 87.4 

85 

50 

7.. 5 

87.5- 92.4 

90 

76 

11.3 ' 

92.5- 97.4 

97.5- 102.4 

95 

100 

136 

196 

•20.3 V 
' 29.3 

102.5-107.4 

105 

. 83 

12.4. 

107.5-112.4 

110 

26 

3.9 

112.5-117.4 

115 

16 

2.4 

117.15-122.4 

120 

14 

• 2.1 

122.5-127.4 

125 

12 

•1.8' , ■ 

127.5-132.4 

130 

'2 ■ ' ■ 

.3' 

132.5-137.4 

135 

■■ 3 

, , .5'^ ■■ 

137.5-142.4 

140 

. ,5 

.8 

142.5-147.4 

145 

„ 1^ 

■fA .i 

147.5-152.4 

150 



152.5-157.4 

155 

1 ■ : 

' A 



670 

100.0 

The frequency polygon representing this 

distribution ap- 


pears in Fig. 49. For purposes of comparison with similar 
distributions the figure shows the percentage distribution. 


* The 670 eonmodities included were those employed by the U. S. Bureau 

of Labor btatisties in the construction of its index of wholesale prices. The 
figures, and the relatives, appear in Bulletin A7$. of that Bureau, on 
‘'Wholesale Prices, 1913-1927.” 
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The correspondence of this frequency distribution to the 
standard types portrayed in earlier sections is obvious. 
There is the same marked concentration about a central 
tendency, in this case a tendency of prices to remain stable, 
for 29 per cent of all the cases showed a change not exceed- 
ing 2.5 per cent from their prices in the base year. There 
is also, in this case, a fairly s 5 Tnmetrical distribution about 
this central tendency, though the range above the mode is 



50 60 70 80 90 100 110 120 130 140 150 


Relative Price 

Fig, 49. — Frequency Polygon: Distribution of Relative Prices of 
670 Commodities in 1927 (Average prices in 1926 = 100) 

slightly greater than the range below. Without at present 
considering the question as to which average might best be 
used to represent the central tendency in this distribution, 
it is apparent that the use of some average is quite legitimate. 

The example just given has been based upon price varia- 
tions from one year to the next, over a period during which 
the level of general prices declined shghtly (4.6 per cent). 
W. C. Mitchell gives a much more comprehensive illustra- 
tion, based upon the distribution of 5,578 price variations 
from one year to the next over the period 1890-1913, which 
shows the same general grouping. The excess of the range 
above the mode over the Tange below is somewhat more 
pronounced, in connection with which fact it should be 
noted that prices were rising during most of the 23 years 
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covered. The distribution secured by Mitchell is shown in 
Fig. 42. 

The inertia of prices is most conspicuous when year-to- 
year price changes are studied. It is therefore advisable to 
consider the character of price variations over a longer 
period, that we may learn whether the same type of (hs- 
tribution is secured. Two examples are given, one of price 
changes over a seven-year period, marked by a considerable 
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Relative Price 

Fig. 50. — Frequency Polygon: Distribution of Relative Prices of 
774 Commodities in 1933 (Average prices in 1926 = 100) 

decline in prices, the other of price changes over a five-year 
period characterized by rapidly rising prices. The table 
following shows the distribution of 774 price variations, 
prices in 1933 being expressed as relatives on a 1926 base. 
The general level of wholesale prices, it should be noted, 
declined some 33 per cent from 1926 to 1933. 

The data in Table 44 are plotted in the form of a frequency 
polygon in Pig. 50, the percentage distribution being shown. 
It will be noted that the distribution is curtailed, the five 
upper classes being omitted. 

The distributions depicted in Figs. 49 and 50 differ ma- 
terially. The range of the variations is greater in the second 
case, a condition naturally to be expected because of the 
longer period covered. Secondly, a very much smaller per- 
centage of cases is concentrated in the modal group, though 
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Table 44 


Distribution of Relative Prices of 774 CommodiUes in 1933 
(Average prices in 1926=100) 

Relaiim prices 

Mid-pomt 

m 

No. of cases 

f 

Percentage of total 
number of cases 

10- 14.9 

15- 19.9 

12.5 

17.5 

3 

A 

20- 24.9 

22.5 

1 

A 

25- 29.9 

27.5 

7 

.9 

• 30- 34.9 

32.5 

13 

1.7 

35- 39.9 

37.5 

24 

3.1 

40- 44.9 „ 

42.5 

28 

3.6 

45- 49.9 

47.5 

51 

6.6 

50- 54.9 

52.5 

49 

6.3 

55- 59,9 

57,5 

50 

6.5 

60- 64.9 

62.5 

62 

8.0 

65- 69.9 

67.5 

58 

7.5 

70- 74.9 

72.5 

93 

12.0 

75- 79,9 

77.5 

81 

10.6 

80- 84.9 

82.5 

62 

8.0 

85- 89.9 

87.5 

67 

8.7 

90- 94.9 

92.5 

40 

5.2 

95- 99.9 

97.5 

27 

3.5 

100-104.9 

102.5 

27 

3.5 

105-109.9 

107.5 

11 

1.4 

110-114.9 

112.5 

6 

.8 

115-119.9 

117.5 

8 

1.0 

120-124.9 

122,5 

1 

.1 

125-129,9 

127.5 

2 

' A 

155-159.9 

157.5 

1 

/I 

',,180-184.9 

. 182.5 

1 

A/ 

190-194.9 . 

192.5 ■ 

1 

774 

A : 

100.0 


there is still a pronounced central tendency. Both distribu- 
tions, as plotted on the arithmetic scale, are fairly symmetri- 
cal, though a few extreme eases extend the actual upper limit 
of the second distribution. In Fig. 49 the concentration about 
the central tendency is much more marked, and the devia- 
tions of individual price ratios from the central 
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are smaller. This distribution resembles one which would 
be secured from highly accurate physical measurements, or 
the distribution of shots from a very accurate piece of artil- 
lery. The second curve corresponds to one representing less 
accurate physical measurements, or to the distribution of 
shots from an old or inaccxirate field piece. The modal 
value occurs less frequently and the deviations from the 
central tendency are greater. It has been established that 
the longer the period covered in price comparisons such as 
those made above, the more pronounced is the tendency 
shown in the second curve. The value of the maximum 
ordinate falls and the range of the distribution increases. 
The curve becomes flatter and more extended as the time 
interval increases. And, quite obviously, as this process 
goes on the representative character of any type of average 
declines. Unless there is concentration about a central 
tendency an average is merely an abstraction, without con- 
crete significance. 

It is possible at this point to state as a tentative conclu- 
sion that price variations are capable of statistical measure- 
ment, that they may be represented appropriately by an 
average value, provided the period covered is not too long. 
No definite statement can be made as to the maximum 
period over which price variations may be measured. Index 
numbers having accurate and significant values must be 
based upon comparisons over relatively short periods, the 
most accurate being year-to-year comparisons. Index num- 
bers designed merely to show general trends in prices may 
cover longer periods, though the makers and users of such 
index numbers should realize their limitations.^ 

As a final example we may note the distribution of the 
relative prices of 1,437 commodities in 1918, average prices 
during the period July, 1913 to June, 1914 serving as base.^ 

' Cf. W. C. Mitchell, “The Making and Using of Index Numbers,” BvUe- 

iin^$84 (Wholesale Price Series), U. S. Bureau of Labor Statistics. 

® Data compiled by the Price Section of the War Industries Board; repro- 
duced in Part I, Bulletin 2S4, U. S. Bureau of Ijabor Statistics, 70. 
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This was a period marked by rapidly rising prices. In con- 
sulting the graph (Fig. 51) it should be noted that the scales 
are not the same as those employed in the two figures pre- 
ceding. 

A study of this distribution bears out the conclusion 
reached from the two examples preceding. There is a central 
tendency sufficiently pronounced to be well represented by 
an average. In this case, moreover, the modal group is 



Fig. 51. — Frequency Polygon; Distribution of Eelative Prices of 1,437 
Commodities in 1918 (Average prices July 1913 to June 1914 = 100) 


that with a mid-point of 180, so that the tendency toward 
concentration cannot be attributed to inertia, but to the 
presence of external forces affecting the price system as a 
whole. There is, however, one marked point of difference 
between this distribution and the two others. The tendency 
toward skewness, which was in evidence in the first example, 
is pronounced in this case. The cmve, as plotted on the 
arithmetic scale, is markedly asjTnmetrical. The greatest 
concentration is near the lower limit of the scale and a long 
tail, extending in fact far beyond the limit of the chart. 
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tapers out to the right. The highest relative price, indeed, 
is 3,009, representing an increase of 2,909 points. The small- 
est relative price, in comparison, is 36, representing a decline 
of 64 points on the scale. 

A price increase, expressed as a relative, has no upper 
limit. An increase of 100, 500, 1,000 per cent or more is 
conceivable and possible. The greatest price increase noted 
by the War Industries Board in its study of prices during 
the war was one of 4,981 per cent, in the case of acetipheneti- 
din. But 100 per cent is the maximum decline possible, as 
that would mean that the price of a commodity had fallen 
to zero. This is the explanation of the skewness noted in 
the curves shown. When any considerable number of price 
ratios are tabulated the corresponding frequency curve, 
plotted on an arithmetic scale, shows this characteristic 
feature, a feature which is most conspicuous during a period 
of rising prices. 

The argument developed in the preceding pages may be 
briefly summarized. Before discussing the practice of index 
number construction it was considered advisable to study 
the character of the raw materials and the nature of the 
distributions secured when these materials are brought to- 
gether, in order to determine whether ordinary statistical 
methods are appropriate. The raw materials, we have seen, 
consist of individual price variations, expressed as ratios. 
When a number of these ratios are assembled a frequency 
distribution is secured which somewhat resembles the dis- 
tribution of data following the normal law of error. A 
central tendency, which may legitimately be represented 
by an average, is apparent in the distribution of price varia- 
tions. The central tendency is less marked, however, and 
the deviations from it are more pronounced, the longer the 
period covered in the price comparison, so that an average 
becomes less representative as this period increases. In 
addition, a tendency toward skewness has been noted, and 
this was seen to be quite pronounced in a period of rising 
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prices. This skewness is due to the fact that we are dealing 
with ratios that have a definite lower limit and no upper 
limit. 

Variety of Methods Employ'ed in Index Number 
Construction 

Many methods have been and are being employed in the 
construction of index numbers of wholesale prices. Usage 
varies for many reasons. There are differences of opinion 
as to which is theoretically the best method. There are 
practical difficulties to be surmounted, difficulties which 
inevitably cause differences in practice because of the vary- 
ing resources of the agencies engaged in these tasks. And 
there are, finally, differences due to the varying purposes 
for which index numbers are constructed, the varying ques- 
tions they are designed to answer. 

Prevailing differences in practice and differences in the 
results secured by the employment of various methods in 
the construction of index numbers can perhaps be illus- 
trated most effectively by the application of a number of 
methods to the same data. Table 45, on the preceding page, 
presents the raw^ material to which these various methods are 
to be apphed — the average farm prices, on December 1, of 
twelve leading crops, from 1919 to 1935. 

EXPLANATION OF SYMBOLS 

The symbols to be employed in the computation of dif- 
ferent types of index numbers have the following meanings: 

‘Pt! : price of a given commodity at time “0” (the base period). 
: quantity of same commodity at time “0”. 

Pi' : price of same commodity at time “1”. 

g/ : quantity of same commodity at time “ 1 

Pa" : price of a second commodity at time “0”. 

go" : quantity of second commodity at time “0”. 

p" : price of second commodity at time “1 ”. 

q" : quantity of second commodity at time “ 1 

—f : a price relative (relation of price of a given commodity at 
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time “1 ” to price of same commodity at time “0”)- 
: a quantity relative. . 

Po : price level at time “0”. 

Pi ; price level at time “1 

Simple Index Numbers of Prices 

In his exhaustive analysis of methods of index number 
construction * Irving Fisher distinguishes six fundamental 
types: the aggregative (or price aggregate), the arithmetic, 
haiTOonic, geometric, median, and mode. The latter has 
never been employed in a practical way, and may be omitted. 
The characteristics of the five remaining types may be 
brought out by considering each of them in its simplest form, 
before examining the more comphcated combinations. 

AGGREGATES OF ACTUAL PRICES 

In the construction of index numbers of the simple ag- 
gregative type, commodity prices pertaining to a given 
date are added; general price changes are measured by 
comparing the results thus secured for different dates. Using 
the above symbols 

h = 

Po 2po 

When such index numbers are constructed from the data of 
Table 45 the results in Table 46 on page 182 are secured. 
The actual aggregates are given in column (2) ; to facilitate 
comparison the same figures are reduced to relatives, with 
the 1910 aggregate as base, in column (3). 

The results secured by this method of constructing index 
numbers of prices will be compared shortly with results 
secured from the same data by other methods. The chief 
weakness of this type of index number is obvious. This is 
not an unweighted nor yet an equally weighted index. 
The influence of each commodity upon the result is depend- 
ent upon the price of the unit in which it happens to be 

^ The Makinff of Index Numbers, Houghton MifRin Co., 1922. 
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Table 46 


( 1 ) 


Ind^x Numbers of Farm Crop Prices 
(Aggregates of actual prices) 


Year 


1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

1928 

1929 

1930 

1931 

1932 

1933 

1934 

1935 


(2). 

Index 

(figgregate of 

actual prices) 
S36.349 
26.790 
18.690 
19.913 
21.838 
23.142 
23.831 
22.499 
19.291 
19.584 
21.339 
18.290 
13.211 
9.503 
13.691 
20.723 
12.844 


(3) 

Index, relative 
(1919 = 100) 


100 

74 

51 

55 

60 

64 

66 

62 

53 

54 
59 
50 
36 
26 
38 
57 
35 


combined, with flaxseed seenurl • U commodities 

secured by adding the onnt +• ^^Portance. The index 
illogical fashion and cannot weighted in an entirely 

course of farm crop prices. as reflecting the 

« for avoiding the 

rr - 

and the other commoditiVe’ cotton, 

pound, and theL^S by the 

Yet this method, which has ^Qovre the index, 

struction of Bradstreet’s index employed in the con- 
of illogical weighting bv an ’ ^ replaces one system 

we^t, if ™ch\‘Sire"d.“ 

given to all commodities 
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by this method. Thus, in 1919 hay was worth S. 010075 
per pound, cotton $.356 per pound and rice $.059 per 
pound, cotton having a weight in an aggregate of per pound 
prices 6 times that of rice and 35 times that of hay. 

AEITHMETIC AVERAGES OP RELATIVE PRICES 

Another method employed in the construction of index 
numbers involves the reduction of each quoted price to a 
relative, with reference to the price of the same commodity 
at a certain basic date, these relative figures then being 
averaged by any of the conventional methods. The example 
in Table 47 illustrates the first phase of this process, data 
for two years being utilized. The year 1919 is taken as base. 

Table 47 


Computation of Relative Prices for the Construction of Index Numbers 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Comtnodify 

U'nit 

Price, 1919 

Relative 

Price, 1920 

Relative 

Corn 

Bu. 

$ 1.343 

100 

$ .656 

48.8 

Cotton 

Lb. 

.356 

100 

.139 

39.0 

Hay 

Ton (sh.) 

20.150 

100 

17.780 

88.2 

Wheat 

Bu. 

2.131 

100 

1.433 

67.2 

Oats 

Bu. 

.702 

100 

.456 

65.0 

Wh. Potatoes 

Bu. 

1.580 

100 

1.128 

71.4 

Sugar 

Lb. 

,102 

100 

.053 

52.0 

Barley 

Bu. 

1.215 

100 

.716 

58.9 

.Tobacco 

Lb. 

.390 

100 

.212 

54.4 

Eiaxseed 

Bu. 

4.383 

100 

1.770 

40.4 

Rye,. . 

Bu. 

1.331 

100 

1.256 

94.4 

Rice , , 

Bu. 

2.666 

100 

1,200 

1.191 

44.7 

724.4 


From these figures the arithmetic averages of relative 
prices in these two years may be readily computed. The 

formula for any single relative is When there are U 

Po 

relatives the formula for the index number at time “1” is 



N 
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In the present case 

Index (1919) = = 100. 

Index (1920) = -^ = 60.4. 

Index numbers computed in this way for the years 1919 to 
1935, inclusive, are shown in column (3) of Table 50. 

This type of index number is usually termed an “ un- 
weighted” index of relative prices. It is weighted, however, 
just as are the types illustrated in the two examples pre- 
ceding. The quantity employed as weight in each case is 
the amount of each commodity which would sell for $100 
in the base year. In the preceding example the following 
quantities have been employed as weights : 


Corn 

74.0 bu* 

Cotton 

280.9 lbs. 

Hay 

4 , 96 tons 

Wheat 

46.9 bu. 

Oats 

142 . 5 bu. 

Potatoes 

63.3 bu. 

Sugar 

980.4 lbs. 

Barley 

82.3 bu. 

Tobacco 

256.4 lbs. 

Flaxseed 

22.8 bu. 

Rye 

. 75.1 Tm. 

Rice 

, ' 37.5 , bu. 


What has been done, in effect, in the computation of the 
simple average of relative prices has been to determine the 
aggregate amount for which the above quantities would sell 
in each of the eleven years included. At 1919 prices each 
of the above quantities would sell for $100, the aggregate 
value being $1,200; at 1920 prices the aggregate value of 
the above quantities was $724.40. These aggregates, di- 
vided by 12, give the index numbers shown in column (3), 
Table 50: 100 for 1919, 60 (60.4) for 1920, etc. Thus the 
“unweighted average of relative prices” is in fact a weighted 
aggregate of actual prices. It is equally weighted in the 
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sense that the value of the quantity of each conimodity 
employed as weight was equal to $100 in the base year, 1919 A 

MEDIANS OP EELATIVE PRICES 

The median rather than the arithmetic mean may be 
employed in seeming the average of the relative prices for 
each year. When the relatives in column (6) of Table 47 
are arranged in order of magnitude the following distribution 
is secured: 


39.0 

58.9 

40.4 

65.0 

44.7 

67.2 

48.8 

71.4 

52,0 

88.2 

54.4 

94.4 


The smallest relative price is 39.0, the greatest 94.4; 
the median value is 56 . 65. This median value is the index 
number for 1920. Ail the index numbers computed in this 
way from the medians of relative prices are presented in 
column (4), Table 50. 


GEOMETRIC AVERAGES OP RELATIVE PRICES 


The geometric averages of the relative prices for the 
various years may now be computed and the results com- 
pared with those secured in the preceding examples. A 

single relative being represented by the symbol the 
formula for the geometric mean of N relatives is 




Vi 


y V" w VV y 
Pa Pa” Pa'" 


A geometl'ic mean is generally computed by the aid of 
logarithms ; in this case 


Log Mg 




-f . 


N 


‘ Attention was called to this characteristic of the simple average of relative 
prices by F. E. Macaulay, A meriooa Economic Bmeai, Dec., 1915, 928. 
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The method of computation may be illustrated for the 
years 1910 and 1911. The relative prices of the various 
commodities are repeated from Table 47. 

Table 48 

Computation of Geometric Averages of Relative Prices 


(1) 

(2) 

(3) 

(4) 

(5) ; 

Commodity 

Rdatm price, 

Logaritkm of 

ReMive price, 

Logarithm of 

1919 

jig. in col (2) 

1920 

Jig. in col (4) 

Corn 

100 

2.0 

48.8 

1.68842 

Cotton 

100 

2.0 

39.0 

1.59106 

Hay 

100 

2.0 

88.2 

1.94547 

Wheat 

100 

2.0 

67.2 

1.82737 

Oats 

100 

2.0 

65.0 

1.81291 

Wh. Potatoes 

100 

2.0 

71.4 

1.85370 

Sugar 

100 

2.0 

52.0 

1.71600 

Barley 

100 

2.0 

58.9 

1.77012 

Tobacco 

100 

2.0 

54.4 

1.73560 

Flaxseed 

100 

2.0 

40.4 

1.60638 

%e 

100 

2.0 

94.4 

1.97497 

Rice 

100 

2.0 

44.7 

1.65031 



24.0 


2L 17231 


Log Mg (1919) = ~ = 2 

Mg = anti-logarithm of 2 = 100 
Log Mg (1920) = = 1.76436 

Mg — anti-logarithm of 1.76436 = 68.1. 

This value, 58.1, is the index number for 1920. The 
results for all the years are summarized in column (5), 
Table 50. 

HAEMONIC AVERAGES OF RELATIVE PRICES 

The characteristics of the harmonic average have been 
discussed in a preceding chapter. The reciprocal of the 
harmonic mean, it will be recalled, is the arithmetic mean 
of the reciprocals of the constituent measures. The con- 
stituent items, in the present case, are price relatives of the 
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form The reciprocal of such a relative is The 
Po ^ pi 

formula for the harmonic mean of N price relatives is, 

therefore, 


or 


^ f ^ tf ^ fff 

^ 4- Pi, J. 4_ 
1 + + 


Pi 


H 


N 



The method of computation is illustrated in Table 49. 


Table 49 


Computation of Harmonic Averages of Relative Prices 


(1) 

(2) 

(3) 

(4) 

(5) 

Commodity 

Relative price, 

Reciprocal of 

Relative price, 

Reciprocal of 

1919 

fig, in col. (2) 

1920 

fig. in coL (4) 

Com 

100 

.01 

48.8 

.02049180 

Cotton 

100 

.01 

39.0 

.02564103 

Hay 

100 

.01 

88.2 

.01133787 

Wheat 

100 

.01 

67.2 

. 01488095 

Oats 

100 

.01 

65.0 

.01538462 

Wh* Potatoes 

100 

.01 

71.4 

.01400560 

Sugar 

100 

.01 

52.0 

. 01923077 

Barley 

100 

.01 

58.9 

. 01697793 

Tobacco 

100 

-01 

54.4 

. 01838235 

Flaxseed 

100 

01 

40.4 

,02475248 

Rye 

100 

.01 

94.4 

.01059322 

Rice 

100 

.01 

44.7 

.02237136 



. 12 


.21404998' 


H (1919) 





ff (1920) 

12 

= 56.1. 



.21404998 



The index numbers computed in this way for all the years 
included in the study are shown in column (6), Table 50. 
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In the construction of the five types f>f index numbers 
explained above no attempt has been made t o use a logical 
weighting system. All are termed ‘‘unweighted” averages, 
a term which is quite misleading. The first index con- 
structed, based on aggregates of actual prices, is a heavaly 
weighted index number, though the weights are illogical. 
In the next four the quantities employed as weights are the 
amounts purchasable for $100 in 1919. The five results 
are brought together and compared in Table 50. In each case 
the index is given to the nearest whole number. These index 
numbers are plotted in Fig. 52. 

Comparison of Simple Index Numbers 

The four averages of relative prices agree much more 
closely with each other than with the index nmnbers based 
on aggregates. For reasons already suggested the latter is 
quite untrustworthy as a measure of pidce changes. Of the 
other index numbers, the arithmetic, geometric, and har- 
monic means show a consistent relationship, a fact which 
follows from the nature of the averages employed. Except 
in the base year the geometric mean is always less than 
the arithmetic and the harmonic is always less than the 
geometric, the amount of difference increasing as the dis- 
persion of prices becomes greater. The median, with only 
twelve items to be averaged, is somewhat unstable, and its 
relationship to the other averages is not always a consistent 
one. 

How are we to choose among these varying results? No 
one of these “unweighted” index numbers is perfect, for 
weights which have crept in do not measure the relative 
importance of the various commodities included in the 
index numbers. But, neglecting for the moment the question 
of weights, is it possible to test the adequacy of the different 
methods of measuring changes in the prices as given? 
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Table 50 




Index Numbers of Farm Crop Prices, 1919-1935 

(1919 = 100) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 


Aggregates 

Arithmetic 

Medians 

Geometric 

Harmonic 

Year 

of actual 

averages of 

of 

averages of 

averages of 

prices {as 

relative 

relative 

relative 

relative 


relatives) 

prices 

prices 

prices 

prices 

1919 

100 

100 

100 

100 

lOO 

1920 

74 

60 

57 

58 

56 

1921 

51 

44 

42 

43 

42 

1922 

55 

51 

50 

50 

49 

1923 

60 

55 

50 

54 

53 ' 

1924 

64 

60 

61 

59 

58 

1925 

66 

59 

53 

57 

55 

1926 

62 

53 

49 

52 

50 

1927 

53 

53 

55 

52 

52 

1928 

54 

48 

48 

47 

46 

1929 

59 

54 

53 

53 

52 

1930 

50 

38 

32 

36 

35 

1931 

36 

27 

27 

27 

26 

1932 

26 

20 

18 

19 

19 

1933 

38 

35 

33 

34 

34 

1934 

57 

48 

48 

46 

43 

1935 

35 

35 

36 

35 

34 



Pig. 52. — Comparison of Five Simple Index Numbers of Fami Crop 
Prices, 1919-1935 (1919 = 100) 
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THE TIME BEVEESAL TEST 

For this purpose Irving Fisher has employed what he 
terms the “time reversal test.” This is merely a test to 
determine whether a given method will work both ways in 
time, forward and backw'ard. If from 1935 to 1936 sugar 
should increase from four to eight cents a pound, the price 
in 1936 would be 200 per cent of the price in 1935, and the 
price in 1935 would be 50 per cent of the price in 1936. 
One figure is the reciprocal of the other; their product 
(2.00 X .50) is unity. Similarly, if a given method of index 
number construction shows the general price level in one 
year to be 200 per cent of the level in the preceding year, it 
should work correctly when reversed; it should show that 
the price level in the first year was 50 per cent of the price 
level in the second year. When the data for any two years 
are treated by the same method, but with the bases reversed, 
the two index numbers secured should be reciprocals of each 
other. Their product should always be unity. If it is not, 
there is an inherent bias in the method. 

This test may be applied to the methods employed above, 
using prices for 1919 and 1920. With 1919 as base the 
following results were obtained: 


Year 

: Aggregates 
of actual 
prices (as 
relatives) ' 

Arithmetic 
averages of 
; relative 
prices 

Medians of 
rehtwe 
prices 

Geo'meirie 
averages of 
relative 
piiees. 

Hannonie 
averages of 
relative 
prices 

1919 

1920 

100 

73.70216 

100 

60.36666 

100 

56.65 

100 

58.1221 

100 

.56.0617 


and with 1920 as base : 


Year 

Aggregates 
j of actiud 
prices (as 
relatives) 

Aritkinetic 
averages of 
r dative 
prices 

Medians of 
rdative 
prices 

Geometric 
averages of 
relative 
prices 

Harmonic 
averages of 
relative 
prices 

1919 

1^ 

135.68122 

100 

178.36666 

100 

176.85 

100 

172.04 

100 

165.6467 

100 
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When the index numbers for 1911 in the first table are 
multiplied by the corresponding index numbers for 1910 
in the second table, we have the following values. (In 
securing these products the index numbers are put in the 
ratio, not in the percentage form.) 


Aggregates 
of actual 
prices 

Arithmetic 
averages of 
relative 

Medians of 
relative 
prices 

Geometric 
averages of 
relative 

Harmonic 
averages of 
relatim 

prices 

- I 

prices 

prices 

1.00 

1.0767 1 

1.00 

1.00 

.9286 


This time reversal test is met by three of the methods 
employed. It is not met by either the arithmetic or har- 
monic averages. The former has a distinct upward bias, 
amounting to more than seven per cent when the errors for 
1919 and 1920 are compounded, while the harmonic mean 
shows almost as large an error in the opposite direction, 
i Unless the inherent bias which is found in both these aver- 

ages is rectified in some way, methods based upon these 
averages should not be used in the construction of index 
numbers. 

The Weighting of Index Numbers 

Five simple index numbers of prices have been described 
I in the preceding section. With the introduction of weighting 

i the number of possible combinations is greatly increased, 

but only a few of these types need concern us here. 

In the construction of an accurate measure of price changes 
logical weights must be employed, weights which truly reflect 
the relative importance of the commodities included. If the 
weighting problem is ignored haphazard and illogical weights 
will inevitably be present, whether recognized or not. 

The data used in the preceding examples may be utilized 
to illustrate methods of weighting and to show the effects 
of varying weights upon the values of index numbers. The 
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weights employed in constructing index numbers of farm 
crop prices may be either the quantities or values of the 
crops produced, depending upon the type of index selected. 
The quantities produced during the period 1919—1935 are 
given in Table 51. 

WEIGHTED AGGREGATES OF ACTUAL PRICES 

The thoroughly illogical results obtained when actual 
prices, as quoted, are totaled to secure an index number 
have been pointed out. The same objection cannot be 
made when the prices are appropriately weighted before 
the aggregate is taken. If for weights we employ the quan- 
tities produced in the base year (at time “0”) the formula 
for the weighted aggregate is 

SjPign 

This is, in effect, the method employed by the United States 
Bureau of Labor Statistics, though the quantities are taken 
from a year' other than the base year. The formula for this 
type of weighted aggregative index is known as Laspeyres’ 
formula. The method is illustrated in Table 52. 

The desired index numbers, in the form of relatives, may 
be computed from the aggregates secured by totaling col- 
umns (5) and (8) of Table 52. Either year may be taken 
as the base, and the price aggregate in the other year ex- 
pressed as a relative on this base. With the 1919 aggregate 
as base the index for 1920 is 58.2. Index numbers similarly 
computed for the other years are given in column (2), 
Table 55. 

Another tjqie of weighted aggregate may be constructed, 
with weights taken not from the base period but from the 
later period in the given comparison. That is, we may 
employ qi (quantity at time ‘-1”) as weight in comparing 
prices at time “ 1 ” with prices at time “ 0,” and employ ?2 
(quantity at time “2”) as weight in comparing nriees at 
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time “2” with prices at time “0.” Algebraically, the 
formula for the index number at time “ 1 ” is 

Spogi’ 

This is known as Paasche’s formula. The process of compu- 
tation is precisely the same as in the preceding example, 
except that the weights are changed with each successive 
year. The index numbers secured by this method are given 
in column (3), Table 55. 

The weights in these two cases have been quantities, for 
prices, multiplied by quantities, give aggregates in dollar 
values. But in weighting individual price relatives, quanti- 
ties will not serve. The abstract relatives must be weighted 
by values, if the resulting products are to be compai'able. 
For values are in terms of a common dollar unit, while 
quantities may be expressed in a variety of units. The values 
which are to be employed as weights may be derived in 
various ways. 

Fisher ^ outlines the four following methods, of which the 
second and third are hybrid types: 

I. Each weight = base year price X base year quantity (poQ'o). 

II. Each weight = base year price X given year quantity (po?i). 

III. Each weight = given year price X base year quantity ipiqO- 

IV. Each weight = given year price X given year quantity (piqi). 

Just as certain averages possess inherent bias, so a distinc- 
tive weight bias arises from each type of value weighting. 
(This inherent bias is absent from the quantity weighting.) 
A downward bias arises from weighting systems I and II 
(in which base year prices are used), while an upward bias 
arises from weighting systems III and IV (using prices in 
the given year). This is in part capable of mathematical 
demonstration ® and has in part been established by numer- 
ous trials. 

^Irving Fisher^ The Making of Index Numbers ^ 54, 

^ An index weighted by type III must exceed an index weighted by type I 

(Footnote B continuS an page 196,) 
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In the several examples next following we shall deal only 
with values of quantities produced in the base year, 1919. 
These values are given in the third column of Table 53. For 
weighting purposes they are taken to the nearest million. 

WEIGHTED ARITHMETIC AVERAGES OF 
RELATIVE PRICES 

In the computation of an index of this type, each relative 
is multiplied by the appropriate weight and the sum of the 
products is divided by the sum of the weights. The process 
is illustrated in Table 53. 

The index for 1920, it will be noted, is identical with that 
secured from the computations illustrated in Table 52. That 
index is a weighted aggregate of actual prices, the weights 
being the quantities produced in the base year. An arith- 
metic mean of relative prices, weighted by values in the base 
year, is always equal to a relative constructed from such an 
aggregate.^ 

(Footnote 2 continued pom 'page 19o.) 

Weighting the price relative of a given commodity by type III, we have 

- X piqa 

po 

while by type I we have 

~ X Mo. 

'Po 

If Pi exceeds po (if the price relative is above 100) the weight by type III 
(pi?o) is greater than the weight by type r(pogo). Timt is, all relatives above 
100 are more heavily weighted by type III than by type I. But if pi is less than 
po the weight by type III (piqo) is less than the weiglit by type I (pogo). Ail 
relatives below 100 are less heavily weighted by type III than by type I. Thus 
the e^ect of all price increases is over-emphasized and the effect of all price 
declines is under-emphasized by type III, giving a net result always greater 
than type I. The same is true of type IV as compared mth type II. As be- 
tween types I and IV there is no necessary relation, but in general an index 
weighted by type IV will exceed an index weighted by type I, Base year 
weighting involves a downward bias while given year ’weighting involves an 
upward bias. (For a more detailed discussion of bias in rveighting see Fisher, 
The Making of Index NumherSy Chapter V and pages 384-387.) 

i This may be readily demonstrated algebraically. The value of any com- 
modity m the base year is pogoj while the price relative for a second year is ^ • 

Po 

(Footnote 1 coMinued on page 197.) 
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Table 63 


Computation of Weighted Arithmetic Averages of Relative Prices 


Com- 

modity 

Relative 

price, 

1919 

Weight 

Relative 

price 

X weight 

1 Relative 
price^ 
1920 

Weight 

Relative 

price 

X weight 

Corn 

100 

$3,598 

$359,800 

48.8 

$3,598 

$175,582.4 

Cotton 

100 

2,031 

203,100 

39.0 

2,03i 

79,209.0 

Hay 

100 

1,543 

154,300 

88.2 

1,543 

136,092.6 

Wheat 

100 

2,029 

202,900 

67.2 

2,029 

136,348.8 

Oats 

100 

777 

77,700 

65.0 

777 

50,505.0 

Potatoes 1 

100 

470 

47,000 

71.4 

470 

33,558.0 

Sugar 

100 

446 

44,600 

52.0 

446 

23,192.0 

Barley 

100 

159 

15,900 

58.9 

159 

9,365.1 

Tobacco 

100 

563 

56,300 

54.4 

563 

30,627.2 

Flaxseed 

100 

30 

3,000 

40.4 

30 

1,212.0 

Rye 

100 

105 

10,500 

94.4 

105 

9,912.0 

Rice 

100 

114 

11,400 

44.7 

114 

5,095.8 



111,865 

$1,186,500 


111,865 

$690,699.9 


(The weights employed are the values of the quantities produced in 
1919, in millions.) 

Weighted arithmetic mean (1919) = ~ 

#11,005 

Weighted arithmetic mean (1920) = = 5g.2 

#11,500 

(Footnote 1 continued from page 196.) 

The weighted mean of such price relatives is equal to 

pi pi' pi"* 

X PoV + ^, X Po"go" + f777 X po’V + . . . 

Po Pq po 

PoV + Po' V + po" V'' -h . . . 

which; reduces: to 

Spigo 

Spogc’ 

a weighted aggregate of the type mentioned. . 

In' the same, way the harmonic mean,- weighted, by full values in the second' 
year,^ reduces : to , 

Spigi 

Spo^i 

This .has .already; been .encountered as an aggregate of - actual , prices weighted 
by ^quantities in the second year. , . 
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■WBIGHTEO GEOMETRIC AVERAGES OP 
RELATIVE PRICES 

The process of computing the weighted geometric mean 
is identical with that of computing the unweighted geometric 
mean, except that the logarithm of each relative is multi- 
plied by the given weight and the sum of these weighted 
logarithms is divided by the sum of the weights, the result 
being the logarithm of the desired index.' The method is 
illustrated in Table 54. 

Table 54 

Computation of Weighted Geometric Avei’age of Relative Prices, 1920 




(1919 = 100) 



Commodity 

Relative 
price, 1920 

Loganthm of 
relaUve pfice 

Weight 

Logarithm of 
relative price 
X weight 

Corn 

48.8 

1.68842 

3,598 

6074.93516 

Cotton 

39.0 

1.59106 

2,031 

3231.44286 

Hay 

88.2 

1.94547 

1,543 

3001.86021 

Wheat 

67.2 

1,82737 

2,029 

3707.73373 

Oats 

65.0 

1.81291 

777 

1408.63107 

Potatoes, Wh. 

71.4 

1.85370 

470 

871.23900 

Sugar 

52.0 

1.71600 

446 

765.33600 

Barley 

58.9 

1.77012 

159 

281.44908 

Tobacco 

54.4 

1.73660 

563 

977.14280 

Flaxseed 

40.4 

1.60638 

30 

48.19140 

Rye 

94.4 

1.97497 

105 

207.37185 

Rice 

44.7 

1.65031 

' 114 

188.13534 


Log Mg == 

20,763.46850 

11,865 

= 1 .74998, 

20,763.46856 


11 ,865 



M, = 56.2 


The index for 1920 on the 1919 base is 56.2. Measure- 
ments secured for all the years of the period covered are 
given in column (5), Table 55, together with the other 
weighted index numbers already explained. 

^ The formula for the weighted geometric mean is given in Chapter IV. 
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How are we to judge of the relative merits of these three 
index numbers? We may, first, apply the time reversal 
test which was employed in comparing the five simple index 
numbers. This test is not met by any of the weighted types 
we have constructed. The geometric is equally at fault 
with the others. Though the simple geometric meets the 
test, the introduction of weighting imparts a bias to the 
result. Judged by that test alone none of the three is sat- 
isfactory. We may next try the second fundamental test 
that Fisher has developed, which is termed the “factor 
reversal test.” 


THE FACTOR REVERSAL TEST 


The total value of a given commodity in a given year is, 
of course, the product of the quantity produced and the 
price per unit; algebraically, it is equal to ^'q' . The ratio 
of the total value in one year to the total value in the preced- 


ing year is If) from one year to the next, both price 

Po ?o 

and quantity should double, the price relative would be 200, 
the quantity relative 200, and the value relative 400. The 
total value in the second year would be foui’ times the value 
in the first year. The value relative would be equal to the 
product of the price and quantity relatives, a relationship 
which is obvious in the case of a single commodity. 

If, for a number of commodities, we construct an index 
of the price change from one year to the next and an index 
of the quantity change from one year to the next, we should 
expect their product to be equal to the ratio of the total 
values in the second year to the total values in the first 
year. If the product is not equal to the value ratio, there 
is an error in one or both of the index numbers. 

As an illustration, we may apply this test to the first 


aggregative index constructed 


An index of quan- 


tities may be computed from this same formula, merely 
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interchanging the q’s and the p’s; the formula becomes 

^QiPo 

SgoPo 

The same price factor appears in numerator and denom- 
inator, as we desire to measure only the effect of the quan- 
tity change. Substituting the given values of the twelve 
farm crops we have 

Quantity index, 1920 (1919 - 100) - ^^0 " ‘ 

In percentage form the index of quantities produced in 
1920 is 109 . 56, with 1919 as base. The corresponding price 
index, by the same formula, is 58.24. The product 
1.0956 X .5824 = .6381. 

That is, if prices have decreased 41.76 per cent, while 
quantities have increased 9.56 per cent, the total value 
should show a decrease of 36.19 per cent. 

For the value ratio we have 


Spigi _ $7,441,317,450 ^ 

Spogo $11,864,461,250 ‘ 

There is a discrepancy here of about one per cent. The 
actual error is not great, but the formula definitely fails to 
meet the factor reversal test, and cannot be accepted as 
satisfactory. 

When this test is applied to the second aggregative index 
we secure the following values for 1920, with respect to 
1919 as base: 

Price index = = 57.25 

Spogr 

Quantity index = 1 = 107.69 

SgoPi 

Product = .5725 X 1.0769 = .6165 
(In securing the product the index numbers are put in 
the ratio, not in the percentage form.) 

Here is an error of the same magnitude in the other direction. 


THE “IDEAL” INDEX 


201 


The weighted geometric average also fails to meet this 
fundamental factor reversal test. With respect to both the 
geometric index and the aggregates we have, apparently, 
by the introduction of weights spoiled index numbers which 
in their simple form were unbiased. Yet weights we must 
have, if the index numbers are to represent the facts ac- 
curately. Neither a simple index nor a weighted form of a 
simple index will meet the two tests laid down as funda- 
mental. Professor Fisher tested 46 such formulas, of which 
only four (the simple geometric, median, mode, and ag- 
gregative) met the time reversal test, and none met the 
factor reversal test. 


THE “ideal” index 


A way out of this difficulty is offered by the possibility 
of “rectifying” formulas in a crossing process, by averaging 
geometrically formulas which err in opposite dhections. 
Professor Fisher has made exhaustive trials of all possible 
formulas by this process, finding thirteen formulas in all 
which met both tests. Of these he has selected one as 
“ideal,” from the mewpoint of both accuracy and simphcity 
of calculation. This ideal index is the geometric mean of the 
two aggregative types illustrated above. Its formula ^ is 






X 


Spo?i 


This index may be computed readily, in the present 
instance, from the results already obtained. Thus for 1920 
we have 

Ideal index = V.5824 X .5725 
=.5774. 

In the customary percentage form this is 57. 74. 

This index number meets both the time reversal and the 
factor reversal test. Applying the former : 

^ The. same formula was developed independently by Bowley, Pigou^ Walsii,' 
mdYoimg, \8ee:The Making of Index 'NumberSy.xw, 240-24%. 


302 


INDEX NUMBERS OP PRICES 


Index of prices, 1920 (1919 = 100) == 57.74 
Index of prices, 1919 (1920 = 100) = 173.18 
.5774 X 1.7318 = 1.00. 

For the factor reversal test, applied to the data for 1920 
(with 1919 as base), we have 

Index of prices = \/ x = 57 74 

Y Ipoqo Spogi 

Index of quantities = ^ 1 08.62. 

V ^qopo S?oPi 


Value ratio = 


Spoffo 


.6272. 


Product of price and quantity indices = .5774 X 1.0862 = .6272. 


The ideal index, the two weighted aggregates that enter 
into its construction and the geometric mean weighted by 



Pig 53. — Comparison of Four Weighted Index Numbers of Farm Crop 
Prices, 1919-1935 (1919 = 100) 


values in the base year are given in Table 55 for the years 
1919 to 1935. The index numbers are plotted in Fig. 53. 

The wide discrepancies that were found between the vari- 
ous simple index numbers do not appear when the weighted 
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Table 65 


Comparison of Weighted Index Numbers of Farm Crop Prices, 

1919-1935 


(1) 

(2) 

(3) 

(4) 

(5) 

Year 

Aggregative 
(weighted by 
base year 
quantities) 

Aggregative 
{^weighted by 
given year 
quantities) 

Ideal index 
Geometric 
mean of in- 
dices in cols, 
(2) and (3) 

Weighted 
geofnetric average 
(weighted by 
base year 
qumitiUes) 







'Epoqn 




1919 

100.0 ' 

100.0 

100.0 

100.0 ■ 

1920 

58.2 

57.2 

57.7 

56.2 

1921 

■ 42.8 

42.0 

42.4 

41.5. 

1922 

53.6 ' 

53.1 

, 53.4 

52.9 

1923 

59.8 

59.7 

59.8 

58. 1 

1924 

65.0 

64.3 

64.6 

64.4 

1925 

57.9 

56.3 

57.1 

56. 5 

1926 

51.4 ■ 

49.2 

50.3 

49.6. 

1927 

54.5 ■ 

54.3 

54.4 

54.3, 

1928 

51.8 

51.1 

51.4 

51.2 

1929 , 

54.1 

53.3 

53.7 

53.4 

1930 

41.3 

39.6 

40.4 

39.4 

1931 

26.0 . 

25.5 , 

26.0 

25.3 

1932 

19.0 

18.9 

19.0 

18.1 

1933 

32,. 6 

32.2 

32.4 

32.2 

1934 

^ 51.6 

52.1 

51.8 

49.5 • 

1935 

38,1 

37.6 

37.9 

37.8 


indices are compared. There are significant differences, but 
there is none of the erratic behavior of some of the simpler 
forms. 

Of these four types the ideal index pi’obably serves 
as the best measure of the average price change between 
1919 and each of the given years.^ It is designed, it should 
be remembered, to measure the change between two stated 
times, and not for intermediate comparison. The value of 
the index for 1933, for instance^ is determined by the rela- 
tion between prices and quantities in 1919 and in 1933. 

‘ The year 1919, which is here employed as base, is not a satisfactory stand- 
ard of reference for economic purposes. It was a disturbed year, marking a 
transition from war-time to peace-time conditions. 
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There is double weighting and the weights vary from year 
to year. If 1933 is to be compared with 1932 a new index 
is needed, in which the prices and quantities for 1933 and 
1932 alone are included. Direct comparison on the basis 
of the values for the ideal index given in the above table 
is liable to error, because of the weighting system employed. 

It is one of the merits of the geometric mean with constant 
weights that it permits the index for each year to be com- 
pared directly not only with the base year index, but with 
the index for any other year. The base may be shifted 
directly from the relatives, and the same result will be 
secured as if the computation were made from the original 
data. If this same system be followed with the ideal index 
no large errors may be expected, but strict accuracy will 
not be secured.* 

SOME ALTERNATIVE TYPES 

The chief obstacles in the way of general adoption of the 
ideal index arise from the difficulty of obtaining annual or 
monthly quantities to use as weights, and from the time 
involved in its computation. Where accuracy is essential 
the latter is not a serious difficulty. As a substitute formula 
which is much more quickly calculated Fisher has proposed 

S(go + qi)pi 
2(^0 ■+• yOpo 

This formula, which has also been recommended by Edge- 
worth and Marshall, is considered by Fisher to be “the 
best practical all-around formula, taking all four points 
into account — accuracy, speed, minimum legitimate cir- 
cular discrepancy, simpheity.” Results from this formula 
will generally differ from those secured from the ideal for- 

^ If year to year comparison be a primary aim in a given instance, the ideal 
index may be constructed on the chain system. Link index numbers are first 
constructed, each year serving as base for the computation of the index for the 
succeeding year. These links may then be “chained” with reference to a fixed 
base. Warren M. Persons has shown that the errors involved in followdng this 
method are cumulative, and may be serious if the links are chained for a number 
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Computation of Aggregative Index ^ Weighted hy Combined Quantities 
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57.7 (index for 1920 on 1919 base, in percentage form) 
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1 JN D E X N IT M B E R S O F P R I C E S 

mula by less than one fourth of one per cent. Table 56 
on page 205 illustrates the method of computation, data for 
1919 and 1920 being employed. 

This formula requires the same data as the ideal index, 
and these are not generally to be had. Usually it is only 
possible to secure comprehensive quantity figures at each 
census period, and for the intervening years constant weights 
must be employed. In such cases the weighted aggregative 

Spifjo 

Spo9o 

is probably the most generally useful type. The freighted 
geometric has many virtues, but is subject to a definite 
weighting bias. If no weights can be secured, or even ap- 
proximated, the simple geometric and the simple median 
are far better than any of the other simple types. The 
geometric mean is more generally useful than the median. 

An index number of prices is always based upon the study 
of a sample, the result being taken as representative of the 
entire field of prices from which the particular sample was 
drawn. Some method is needed, therefore, by which we may 
judge of the reliability of the different types of index num- 
bers, of their probable stability when computed from a 
number of successive samples. Some differences might be 
expected between index numbers based upon different sam- 
ples. With which type of index number would these differ- 
ences due to fluctuations of sampling be least? * 

Truman L. Kelley ^ has attempted to measure the prob- 
able errors of the chief types of index numbers and has 
graded these types on the basis of excellence in this respect. 
Two index numbers, the weighted geometric mean and the 
weighted median, are given the highest grade, as being the 
most reliable, the least affected by fluctuations of sampling. 

* The subject of sampling, in relation to the reliability of statistical measures, 
is discussed in greater detail below. 

“Truman L. Kelley, SMisdcal Method, New York, Macmillan, 1921, 
334-346. 
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Fisher’s ideal index is ranked somewhat lower, though above 
the weighted arithmetic and harmonic averages of price 
relatives. The simple unweighted arithmetic average of 
relatives is given the lowest rating in the list. 

For reliability, flexibility, and general excellence Kelley 
selects the weighted geometric mean as the best type of 
price index number. A ratio of aggregates 

Spiw 
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with selected weights (not necessarily precisely equal to the 
quantities marketed or consumed) is given a total score, 
based on the essential requirements of a good index number, 
as high as that of the weighted geometric mean and higher 
than that of the ideal index. Weights other than actual 
quantities are used in order that there may be flexibility 
in the matter of weighting. 

The detailed discussion of procedures in the preceding 
pages has clearly shown that there are some definitely faulty 
formulas, obviously xmsuited for use in the construction of 
index numbers serving ordinary purposes. Among the better 
formulas there are some differences in respect of liability 
to bias and character of data needed, and some variations in 
sampling reliability. The maker of index numbers will have 
these in mind in choosing a formula to employ under given 
conditions. A more important factor in his choice, however, 
will be the pm’pose to be served by the index number, the 
question it is designed to answer. A weighted aggregate of 
actual prices answers one question definitively. It gives, 
without equivocation, the aggregate cost of a fixed bill of 
goods at one period, in relation to the cost of the same bill 
of goods at another. A geometric mean of relative prices 
answers another question. It measures with accuracy the 
average ratio of the prices of given commodities at one period 
to corresponding prices at another period. Some questions 
(for example, that answered by an unweighted arithmetic 
average of relative prices) have little if any economic sig- 
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mficance. It is because one or two main questions have 
bulked large m economic discussion that emphasis has been 
placed upon the findingof a “best” typeof index number. Yet 
the terms “best ” and “ideal ” are unfortunate, for they imply 
that some absolute standard exists, with reference to which 
all formulas may be tested. No such absolute criterion may 
be aphed to the diversity of research problems that call 
for the construction of index numbers. On the basis of his 
powlepe of the characteristics of different formulas the 
dprimmating investigator will choose technical methods 
adapted to his data and appropriate to his purposes. 


Other Problems Involved in the Construction 
OF- Peice Index Numbees 

The preceding section has dealt with the technical prob- 
_ems pnnected with the averaging of a given set of data 
m order to secure an index number of price variations. 
Certain mphods have been shown to be quite faulty, while 
certain others have been found to be appropriate for given 
purposes. One who would use index numbers with intelli- 
gence should uperstand fully the methods which have 
een employed m securing given results, in order that he 
may know precisely what the given figure is designed to 
measure and what degree of reliability attaches to it. 

ph problems as these are not the only ones which 
pnfrpt those who construct index numbers, nor are these 
only ones which users of index numbers 
hould bear m pnd. Of equal importance with problems 
weighting are the practical questions con- 
nnlv P selection of representative samples. The 

onmm ^ ® ^ accurate measure of the general level of 

be secured by determining the ratio 
K • including credit, in circulation 

circulation) and aU the 
nerinrl ^ ^ goods exchanged for money over a given 
e measurement of general price changes between 



NUMBER OP COMMODITIES 


209 


two periods would thus involve complete knowledge of these 
two factors for each of the two periods. Such knowledge, 
of course, cannot be had, so recourse must be had to the 
method of sampling. And primary importance attaches to 
the number of commodities and the character of the com- 
modities upon the prices of which a given index number is 
based. 

NUMBEK OP COMMODITIES TO BE INCLUDED 

Here again we are confronted with a relation that has 
already been mentioned, the relation between methods and 
uses. Decision as to the number of commodities and the 
kinds of commodities to be included in a given case must 
rest upon the purpose for which the index is to be con- 
structed. Assuming that the index number is to serve as a 
measure of general changes in the price level, the ques- 
tion as to the number of commodities to be included may 
be easily answered — the larger the sample the more rep- 
resentative will be the results. The frequency polygon 
based upon a large sample will approach more closely to the 
ideal curve which would represent all price quotations than 
will that based upon a small sample. Thus, as a measure 
of general price changes, more confidence may be placed 
in the Bureau of Labor Statistics index, which is based 
upon 813 price quotations, than in Bradstreet’s, which was 
based upon 96 quotations, though the latter had particular 
virtues of its own.^ Yet index numbers based upon a small 
number of quotations may not be ruled out as without 
value. Wesley C. Mitchell, whose researches have ma- 
terially increased our knowledge of the price system and of 
the characteristics of index numbers, has compared in detail 
index numbers based upon varying numbers of quotations. 
Unexpected similarities are found. Those constructed from 
a limited number of quotations reflect the broad movements 
of prices in much the same way as do those based upon the 

^ Bradstreet’s index Yuas discontinued at the end of the year 
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prices of several hundred commodities. In important details 
there are differences, however, differences which may in- 
volve doubt as to the movement of prices in a given year. 
In such cases the index numbers based upon many quota- 
tions must be accepted as more accurate measures of general 
price movements, provided that the commodities included 
be equally representative of the various elements of the price 
system. 

For other purposes, however, index numbers based upon 
a limited number of quotations may be preferable. This 
is particularly true when a “sensitive” index is desired, one 
that will serve as a forecaster of general price movements 
rather than as a precise measure of changes in the general 
price level. Of this type is the Harvard sensitive price index 
based upon quotations on 13 basic commodities (raw ma- 
terials). The purposes of such an index are served by the 
selection of a limited number of commodities the prices of 
which are subject to extreme fluctuations, rather than by 
the inclusion of a great many commodities. Yet the uses 
to which an index of this type may be put are limited. 
The “sluggishness” of the many-commodities index number 
is a sluggishness which inheres in the price system, and 
which must be reflected in a faithful index of general 
prices. 

The question of the number of commodities to be included 
cannot be discussed apart from that of the character of 
these commodities. The representative character of an index 
number rests in part upon the number of price series in- 
cluded, but the nature of these series is of even greater 
importance. For there are highly significant d ifferences in 
the behavior of the prices of different commodity groups. 
These groups of prices, their interrelations, their behavior, 
their relation to the functioning of the economic system 
and to the swings of prosperity and depression, are matters 
of immediate and practical importance to economists and 
businessmen. 
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PEICE GROUPS IN THE FIELD OP WHOLESALE PRICES 

Since an index number of wholesale prices must rest upon 
sample quotations, the sample must be representative, must 
include commodities whose prices are typical of the various 
elements in the price system. The division into elements 
for this purpose must be based upon the character of the 
price changes peculiar to the different groups. Of the 
groups thus distinguished, the most obvious are those rep- 
resenting different industries. Textile prices and steel prices, 
leather prices and the prices of chemicals are subject to 
different influences. Trade depressions and revivals do not 
affect all industries at the same time or in the same way, 
so that an index of wholesale prices must include quotations 
from all important industrial groups. If preponderant in- 
fluence upon an index is exerted by the prices of certain 
types of commodities, the index, by that much, loses its 
representative character. Thus Bradstreet’s index, it has 
been established, gave greater weight to cotton fabrics, 
hides and leather, and cured meats than was justified by their 
actual importance in trade, a fact which did not detract 
from its utility for some purposes but which lessened its 
value as a representative index of wholesale prices. 

The extent of these differences between the price move- 
ments of commodities in different industrial groups may be 
appreciated by comparison of the index numbers of whole- 
sale prices of grains and metals and metal products during 
the business recession that began in the summer of 1937. 

In order that an index may be representative it is not alone 
sufficient that all industries be given an appropriate number 
of representatives in the sample. Raw materials and man- 
ufactured goods show characteristic differences in their fluc- 
tuations, and fitting representation must be given to each 
of these groups. Prices of the former are, in general, more 
sensitive to changes in business conditions, their movements 
preceding those of manufactured goods and showing more 
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violent fluctuations. There are several reasons for this. 
Raw materials are traded in for purposes of manufacture 
and sale. When business improves after a period of depres- 
sion, increased demand on the part of consumers (or expected 
mcrease in demand) leads competing manufacturers to bid 
agamst each other for raw materials. It is in the raw ma- 
terial markets that the pressure of increased demand first 
centers, and this bidding generally causes prices to rise in 
these markets before the prices of other goods are affected, 
bimilarly at the first evidence of slackening trade manu- 
lacturers demand for raw materials falls off. Business 
forces pure and simple play in the raw material markets 
wi ^ more freedom than in the markets for manufactured 
goods. Hence the tendency of prices in these markets to 
anticipate, in their movements, prices in other commodity 


Additional reasons for the greater stability of prices of 
manufactured goods are found in the fact that these prices 
include a greater percentage of stable cost factors, and in 
the control over supply exercised by most manufacturers. 

agej interests, rents move more slowly and less violently 
than do commodity prices. The inclusion of these elements 
in conimodity prices tends to render these prices more stable. 
Iherefore as commodities move forward from the raw stage 
o eir na manufactured condition their prices include 
more and more of these stabilizing elements, and become 
ss vio en m their fluctuations.* Control over supply, 
which manufacturers possess in much higher degree than 
prmary producers, makes possible the enforcement of defi- 
nite price policies by fabricators. Under these conditions, 
a e puces and variable output are usually found, 
ac o the groups last mentioned contains minor groups 
commodities with distinct price characteristics. Within 
tne raw matenal group there are marked differences between 
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agricultural products, animal products, forest products, and 
mineral products. Agricultural products are affected by 
weather and crop conditions as well as by business conditions 
and, though subject to price fluctuations of some magnitude, 
reflect prevailing business conditions less accurately than do 
the prices of mineral products.^ Animal and forest products 
appear to stand between these two with respect to the 
faithfulness with which they reflect business conditions in 
their price movements. Thus, in selecting raw materials 
for inclusion in a sample of price quotations from which a 
representative index number is to be constructed, fair weight 
must be given to these various classes.^ 

Manufactured goods, again, do not constitute a single 
homogeneous group with respect to their price movements. 
In so far as they are to be used for further production, or to 
undergo further manufacture, they resemble raw materials 
in relation to the bidding of competing manufacturers, and 
their prices, therefore, are characterized by relatively wide 
oscillations. In so far as the demand for them is for the pur- 
pose of final consumption, purely business forces have less 
weight, and their prices are more stable. Related to this 
argument is that which has already been presented, the 
increasing stability of prices as the stable elements of wages 
and overhead charges bulk larger in commodity costs. So, 
again, the sample price quotations from which an index of 
wholesale prices is to be constructed must include prices 
representative of producers’ and consumers’ goods, of goods 
in the intermediate as well as the final stages of manufacture. 

Other important divisions of the price system exist. The 
behavior of the prices of capital equipment differs from that 
of prices of goods intended for human consumption. The 

Mt should not be inferred from this that there is no relation between agri- 
cultural production and the prices of agricultural products, and general business 
conditions. The immediate price relation is frequently one of contradictory 
movements, and cycles in agricultural production are not synchronous with 
business cycles. But conditions in these two fields of economic activity are 
mutually related in many ways. 

®Cf. ‘‘'The Making and Using, of Index Numbers,’^ 47. : , 
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prices of durable goods differ in t heir fluctuations from the 
prices of perishable goods. Goods imported into a given 
country and goods exported from that country are usually 
subject to the play of different forces. A representative 
index number of wholesale prices should be based upon price 
quotations drawn from all commodity groups marked by 
distinctive modes of behavior, with weight given to each in 
proportion to the relative importance in trade of the com- 
modities in that category . 

Price Comparisons over Time 

In the opening pages of this chapter the fact was noted 
that the degree of dispersion found in frequency distributions 
of price relatives depended upon the length of time covered 
in price comparisons. Hence, on statistical grounds, there is 
justification for the conclusion that the accuracy of well- 
constructed price indices is high for measurements extending 
over a short interval, and becomes progressively lower as 
the range of the time comparison increases. This conclusion 
is supported by other considerations. 

In Laspeyres’ formula, 

I - 

Spogo 

the price factor alone varies, as between numerator and de- 
nominator. The constant weighting factor, qo, is assumed to 
define quantities entering into trade in an unchanging system 
of income distribution, living standards, consumption habits, 
etc. This system, for which Sir George Knibbs has used 
the term “regimen,” is taken to be common to the two 
periods compared. If it is constant, and if the q’s which 
define its quantitative attributes are unchanged, then we may 
expect to measure with accuracy the one factor which does 
change — commodity prices. The condition we have here as- 
sumed is the orthodox one of ceteris paribus, the condition that 
factors other than the one subject to study remain constant. 

In fact, of course, the regimen does not remain fixed. 
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Changes in tastes and in consumption habits occur ; changes 
in types of goods used as capital equipment take place; 
incomes shift, and the flow of goods is altered by changes in 
the distribution of buying power among consuming groups ; 
the very price changes that we seek to measure bring altera- 
tions in the demand for given types of goods, and in the quan- 
tities produced. Of no small moment in the total situation 
are the changes that occur in the quality of goods that con- 
tinue to pass by the same trade names. The automobile of 
1938 is the same commodity, by name, as the automobile 
of 1910, but to the average consumer the later model repre- 
sents quite a different bundle of utilities. Similarly, steel, 
textiles, locomotives, even the staple articles of diet have 
undergone important quality changes. A comparison of 
price levels in 1910 and 1938 that depends for its accuracy 
on the assumption that all elements of economic life except 
prices have remained constant is suspect, indeed. 

Our difficulties are not removed if we take as the standard 
of reference the regimen of the second of the two periods 
compared. This is done in Paasche’s formula, 

/ = 

The system of consumption standards and all that goes 
with it may be of modern vintage in this case, but the 
differences between the regimens of the two periods com- 
pared is just as wide. We have not held constant non-price 
factors, and our measurement of price changes loses in 
accuracy, as a result. 

The method exemplified by the Ideal formula, that of 
employing weighting factors drawn from both periods, rep- 
resents one attempt at the solution of this problem, but it 
is far from perfect. The use of quantities drawn from the 
two regimens does not create a common regimen, the indis- 
pensable condition of full accuracy in such comparisons. 

The practical procedure in the face of this difficulty is to 
restrict our comparisons, if high accuracy is required, to 
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periods not widely separated in time. Consumption habits, 
living standards and technical production methods will be 
not widely dissimilar in two such periods, and hence the 
number of identical commodities common to the two periods 
will be large. Under these conditions considerable confi- 
dence may be placed in index numbers measuring average 
price changes. Compaiison of price levels over longer periods 
may be desired, and may be justified, but the margin of error 
in the measurements may be expected to increase as the time 
span extends. Formal precision in weighting and in the selec- 
tion of acceptable formulas will not provide an escape from 
the unavoidable difficulties arising out of alterations in the 
basic conditions of economic life. Real continuity of in- 
dices covering a stretch of years is possible only on the 
basis of a persisting common regimen. 

These considerations support the claims of an index of 
the chain tj^pe, which involves the measurement of price 
changes between successive periods not far apart in time. 
Bruce D. Mudgett has advocated this procedure. The com- 
parison of price levels in two periods, close together in time 
and with similar regimens, will be accurate, if such an index 
as the Ideal be employed. The elements of such a cham 
may then be linked together, in attempting to measure 
price changes between non-consecutive periods. If the regi- 
mens of the non-consecutive periods differ materially, the 
accuracy of the comparison will probably not be high. But 
it is reasonable to believe that better results will be secured 
by bridging the intervening years in the maimer proposed 
than by constructing a single far-flung index based only 
upon the widely dissimilar regimens of two periods far 
removed in time. 

The Wholesale Peice Index op the United States 
Bubbaii OP Laboe Statistics 

The authoritative index of wholesale prices in the United 
States is that compiled by the United States Bureau of Labor 
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Statistics. This index was first constructed in 1902, for the 
period beginning with 1890. It was continued until 1913 
as an unweighted average of relative prices, the base of 
each relative being the average price of the given commodity 
during the ten-year period 1890-1899. Various revisions of 
procedure have been made since 1913. As it stands at pres- 
ent the index for any given period (week, month, or year) 
is a weighted aggregate of actual prices, the aggregate being 
expressed, to facilitate comparison, as a relative with 1926 
as the base. 

The index now includes 813 price series. (A single com- 
modity may be represented by several quotations, the prices 
for different grades or in different markets being given. 
Thus for raw cotton there are three quotations, Middling, 
New Orleans; Aliddling Upland, New York; and Middling 
Upland, Galveston.) In the derivation of the aggregate for 
any date each price quotation is multiplied by a given weight, 
known as a “quantity weight” or a “multiplier.” This 
same weight is applied to the price quotation for the base 
period. The cross products thus obtained for the base period 
and the date in question are values of a stated quantity of 
goods; they differ only in respect of the price factor. The 
following tabulation illustrates the method as applied to 
cotton: 



Average price. 

Quantity 

Average price, 
November, 1937 
X quantity/ 
weight 

Commodity 

November, 1937 
(per pound) 

weight 

(pounds) 


Pk 

qh 

pmh . 

Cotton, Middling, 




New Orleans 

1.080 

1,399,496,000 

$111,959,680 

Cotton, Middling 
Upland, N. Y. 
Cotton, Middling 

.080 

77,750,000 

6,220,000 

Upland, Galveston 

.077 

6,297,729,000 

484,925,133' 


When this process is carried out for the entire 813 price 
series included, the sum of the values in the last column 
gives the index number for the given period, in this case 
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November, 1937. As published, this sum is expressed as a 
relative, the aggregate in 1926 representing 100.t The 
formula for the index measuring the level of wholesale prices 
at time 1,” with reference to the base level at timp “0”is 
thus, ’ 

Pi _ Spiga 

Po 


where q, represents the constant multipliers. The method 
of construction renders it possible to shift the base to any 
desned year or month, changing the given relatives to per- 
centages on the new base. 

This index number, therefore, is based upon the cost at 
wholesale of a biU of goods. The bill of goods remaining 
the same, the total cost changes as the prices of the various 
commodities change, and the index measures the effect of 
these changing individual prices upon the total cost. 

It IS essential, of course, that the quantity used as mul- 
tiplier for each series of price quotations truly represent 
the rela,tive importance of the commodity in question. The 
multipliers employed are approximations to the quantities 
actually marketed. Changes are made from time to time in 
these quantities, the revisions being applied, of course, to 
the base period aggregates as well as to the figures for later 
penods. In addition, when it is necessary to substitute one 
price series for a related one that has been discontinued or 
has lost significance, minor modifications are made in the 
multiphers so as to maintain comparability between the 
ag^egates for periods preceding and periods following the 
date of substitution.* 

The Bureau of Labor Statistics publishes index numbers 
of wholesale prices for 10 major and 45 minor commodity 


tion™a 1935^1937 writing, twelve years removed in time. Adop- 

» tL mS base penod is now being considered. 

CaJwdation ofthe “Reused Method of 

S^el T n^Lfc r. 7 .P«oe Index,” by Jesse M. Cutts and 

1937, 663^74 ' ^ American Statistical Association, December, 
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groups, as well as a general index for all commodities. The 
major groups include farm products, foods, hides and leather 
products, textile products, fuel and lighting, metals and metal 
products, building materials, chemicals and drugs, house fur- 
nishing goods, and miscellaneous commodities. The con- 
stituent elements of the index are also classified into raw- 
materials, semi-manufactured articles and finished products, 
and measurements of price changes for these groups are 
computed. The National Bureau of Economic Research has 
constructed index numbers for various other categories of 
commodities, utilizing the quotations of the Bureau of Labor 
Statistics. These classes include raw and processed goods, 
durable and non-durable goods, producers’ goods and con- 
sumers’ goods, goods destined for use in capital equipment 
and goods destined for human consumption, foods and non- 
foods, and crops, animal products, minerals, and forest 
products.^ The availability of index numbers for various 
significant classes of goods makes it possible to trace price 
changes with more precision, and to interpret them more 
accurately, than when dependence is placed upon a single 
all-embracing index. For the elements of the price system 
are marked by wide diversity in their behavior over both 
long and short periods of time. 

Other Price Index Nxmbers 

The measurement of price changes by the use of index 
numbers has not been confined to wholesale prices. Many 
variations of this device have been utilized in measuring 
price movements in other fields. It will be useful at this 
point briefly to indicate the character of some of these 
variations.^ 

^ See Prices in Recession and Recoveryj l^. Y., NTationai Bureau of Economic 
Research, 1936, 492-540. 

® Detailed information concerning the character and content of a wide variety 
of index numbers, price and other, will be found in An Index to Business Indices^ 
Donald H. Davenport and Frances :V. Scott,,' Chicago, Business Publications, 
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INDEX NUMBEES OF RETAIL PRICES 

An index of retail food prices is published currently by 
the United States Bureau of Labor Statistics. The general 
methods employed are similar to those already explained in 
connection with the index of wholesale prices computed by 
that agency, with such differences as inevitably result from 
the nature of the material. 

Actual retail selling prices of 84 articles of food are secured 
biweekly from dealers in 51 representative cities throughout 
the United States. In weighting the quotations on foods 
of a single type (fresh vegetables, for example) in a given 
city, account is taken of the quantities of such foods con- 
sumed by an average wage-earner’s family in that city or, 
for some regions, in the district in which that city lies. In 
obtaining weights consumption by food groups is considered, 
rather than by specific commodities, since the commodities 
actually priced must be taken to represent similar commodi- 
ties for which no prices are collected. 

The combination for a single city (or geographical area) 
of food prices thus weighted yields an index for that region. 
The food cost index for the United States is computed from 
the aggregates for the 51 cities, each weighted according to 
the population of the area which the city is taken to repre- 
sent. Thus the weights entering the final index of retail 
food prices for the country as a whole represent quantities 
consumed by the average wage-earner’s family, and the 
population assumed to be affected by each series of quoted 
prices. The base of the index numbers, as published, is the 
average of the three-year period 1923-1925. 

The indices of retail food prices, together with index 
numbers of the prices of electricity and coal, at retail, are 
published in the Monthly Labor Review. 

The difficulties inherent in the problem of measuring 
wholesale price movements have been discussed at some 
length. The construction of index numbers of retail prices 
of the type just described presents even greater difficulties. 
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All the theoretical problems arising in the former case are 
to be solved and, in addition, the practical difficulties of 
securing suitable weights, accurate price figures, and com- 
parable quotations are intensified. Because of the lack of com- 
modity standardization, and because of variations in business 
practice and local customs, the latter difficulty is particularly 
acute. For these reasons no index of retail prices at present; 
published can be accepted with the confidence with which 
the best indices of wholesale prices may be received. 

INDEX NUMBERS OF THE COST OP LIVING 

If these problems are acute in constructing an index of 
retail prices they are doubly hard to solve in measuring 
such an entity as the cost of hving. When food prices, 
rents, retail clothing prices, cost of fuel and light, retail 
furniture prices, and the prices of the other niiscellaneou.s 
items which are included in the budget of the average family 
are to be averaged, and an index number constructed to 
measure variations in the cost of these items, numerous 
statistical difficulties must be overcome. Theoretical ques- 
tions concerning the most suitable methods of averaging 
and weighting present themselves, but more important are 
the practical problems involved in the collection of accurate 
and comprehensive prices and weighting data. 

Two index numbers of the cost of living are currently 
compiled in the United States, .one by the Bureau of Labor 
Statistics, one by the National Industrial Conference Board 
of New York. The former appears in the Monthly Labor 
Review, the latter in periodic publications of the Conference 
Board. In each case the chief items of domestic expenditure 
are weighted in accordance with their relative importance in 
household budgets, and the combined results expressed as 
relative numbers. These are given on the 1913 and 1923-1925 
base by the Bureau of Labor Statistics, on the 1923 base 
by the Conference Board.^ 

^ For a' general :disciission^ of the. problem, with, details of; the Conference 
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INDEX NUMBERS OF PRICE AND BUYING POWER OP 
FARM PRODUCTS 

A set of useful index numbers relating to the prices re- 
ceived by and the prices paid by farmers is compiled by the 
United States Department of Agriculture. The first of these 
is based upon the prices at the farm, as of the middle of 
each month, of 34 major farm products and 13 commercial 
truck crops. The weights employed are the average quan- 
tities marketed by farmers during the period 1924-1929. 
Farmers and agricultural economists have need of such a 
specialized index, because the wholesale prices of farm prod- 
ucts in the great exchanges or in large cities are often poor 
representatives of the prices actually received by farmers. 

The index of prices paid by farmers is compiled quarterly 
(in March, June, September, and December) . The constitu- 
ent quotations are retail prices paid by farmers for commod- 
ities used in family maintenance and in production. Weights 
are estimated quantities bought by farmers. The base of 
the index of farm prices, as published, is the average of the 
five pre-war years from August, 1909 to July, 1914; that of 
the index of prices paid, 1910-1914. Measurements for sub- 
groups are given, in both cases. 

These two index numbers are used in the derivation of an 
index of the purchasing power of farm products. The com- 
putation of the purchasing power index may be illustrated 
with reference to the figures for 1936. In that year the index 
of prices of farm products was 114. The index of prices paid 
by farmers was 124. That is, the farmer was receiving 
14 per cent more, on the average, for a unit of product than 
in 1909-1914, but the average price paid by him for a unit of 
goods purchased was 24 per cent higher than in the base 
period. Therefore the purchasing power of an average unit 
of farm products was 8 per cent less than in 1909-1914 
(114-5- 124= .92). 

Board procedxire, see Cost of Lining in the United States, 1914-1938, M. Ada 
Beney, New York, National Industrial Conference Board, 1936. 
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These three index numbers, for selected yeai-s, are given 
in Table 57. 

Table 57 

Index Numbers of Farm Prices, Prices Paid hy Farmers, and the 
Buying Power of Farm Products ’ 


(1) 

(2) 

(3) 

(4) 

Average per unit 

Year 

Prices received 
by farmers 

Pi'ices paid 
by farmers 

purchasing power 
of farm products 
(2) (3) 

1910-1914 

100* 

100 

lOO 

1918 

202 

176 

115 

1920 

211 

201 

105 

1921 

125 

152 

82 

1925 

156 

157 

99 

1929 

146 

153 

95 

1932 

65 

107 

61 

1933 

70 

109 

64 - 

1934 

90 

123 

73 

1935 

108 

125 

86 

1936 

114 

124 

92 

1937 

121 

131 

93 


* Aug., 1909-July, 1914 = 100. 

These are significant measurements, yielding valuable in- 
formation concerning the buying and selling relations that 
vitally affect one important group of producers. The devel- 
opment of similar measurements for other groups will add 
materially to our understanding of the changes that shifting 
market relations entail, in the economy at large. Yet the 
limitations of these index numbers should not be overlooked. 
The measurement of prices paid by farmers and, correspond- 
ingly, the measurement of the purchasing power of farm 
products, are subject to the difficulties referred to in the 
discussion of retail prices and living costs. Under existing 
conditions the margin of error in all such measurements is 
fairly wide. The error is the greater, too, the longer the 
time span covered by the quotations. In the present case, 
goods bought by farmers have undergone greater changes 

^ Source: The Agriculture^ Sitmiim, U. S. Bureau of AgricuitiiraiEcoiiom.ics. 
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in quality than have the fairly standardized staples that 
the farmer sells. Here, as in all price comparisons over time 
greater confidence must attach to short-period comparison^ 
than to those spanning several decades. 
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CHAPTEE VII 


THE ANALYSIS OF TIME SERIES: MEASUREMENT 

OF TREND 

The preceding sections have dealt primarily with frequency 
series and with problems arising in the attempt to organize 
and describe such series. We are now concerned with 
data in the study of which the essential problem is the 
analysis of chronological variations. Such series are of 
major importance in the field of economic statistics, for 
most of the data of economics and business are variables in 
time — as bank clearings, steel production, volunae of sales, 
etc. This dominating importance of series in time is not 
found in any other field of statistical research, and the 
development of methods of analysis appropriate to time 
series has come, accordingly, only within recent years with 
the wider adoption of statistical methods in the field of 
economics. 

Problems connected with time series arise both in the 
ordinary routine of internal administration and in the 
analysis of general economic conditions. Sales, purchases, 
profits on the one hand, stock prices, interest rates, business 
failures on the other, are variables which fluctuate with the 
passage of time. In the analysis of such series it is generally 
desired that the rate and character of growth be determined, 
and that periodic and accidental fluctuations be isolated 
for study. The sales manager wishes to know how the vol- 
ume of sales is faring, when and why it fluctuates and how 
it compares wfith volume of production. The economist 
desires to trace the trend of prices, and to scrutinize minutely 
the upward and downward movements of the price level. 
The making of business plans on even a small scale, as 
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well as the most elaborate schemes of economic forecast 
mg, must rest upon such study of past trends and fluctua- 
tions, and upon comparison of the movements of reS 
series m tme. Scientific study of the business cycle is only 
possible though the application of such methods. Our 
piesent task IS the development of methods appropriate to 
the analysis of series in time. ^ 

The Preliminary Organization of Time Series 

The data of time series usually require less preliminarv 
organization than do statistical data which are to be reduced 
to the form of a frequency distribution. The source pri- 

wSLts the figures are taken usually 

undPrttn^?^ be clearly 

as?f ^^'^«^thly data may be 

ofthe Npw V f the stock price index 

(as for the ^ averages for each month 

+ <■ 1 r ^ Buieau of Labor Statistics’ price index) or 

cmsumption) u ®°ttoii 

L They may be cumulative monthly figures 

^'■“‘*““‘“0 data. If average 

I “a”*’’ ” ^ important to 

now iiow the average has been secured. 

strictTor^li" ™ there be 

Any attemnt^t ^ ^ ^^^ween data for different periods. 

^t te Sp ^ homogeneous 

must be misleadmg and futile. Yet such series me not 

fi"^ Commodity production or com 

governmentfll^^^ Published by trade associations and by 

from a vmt! «°°^etimes based upon returns 

?pril n A series 

different lack comparability as between 

es because of changes in the unit or grade to 
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which the quotations apply, or because quotations are 
drawn from different markets. Changes in census classifica- 
tions may result in lack of comparability of census data. 
A change in a salesman’s territory may alter his returns 
materially. It is stated that the character of the obligations 
represented by the United States Steel Corporation’s figures 
for “unfilled orders” has varied from time to time. Records 
relating to the physical output of a given commodity in 
different periods may be rendered inaccurate by changes 
in quality or design. These are examples of faults that 
may be found in time series, rendering analysis futile. 
Strict testing is essential before a series be accepted as 
accurate and homogeneous. 

GRAPHIC REPRESENTATION OF TIME SERIES 

Normally the first step to be taken in visualizing a series 
in time and in preparing for further analysis consists of 
plotting the data. The trend and general characteristics 
of a series may be most readily apprehended through 
graphic representation. The data may be plotted on ordinary 
arithmetic or on semi-logarithmic paper. The advantages 
of the latter types for certain purposes have already been 
explained. The choice in a given case will depend upon the 
nature of the data and the object of the study. If interest 
lies in the absolute amount of fluctuations in sales, prices, 
pig iron production or whatever may be in process of 
analysis, or in the comparison of absolute differences between 
series, the ordinary rectilinear chart is to be employed. 
If percentage variations and the comparison of relative 
fluctuations are matters of interest, the semi-logarithmic 
representation is preferable. In general, if one is accustomed 
to the interpretation of this latter type of chart, its use is 
advisable. A clearer, less-distorted presentation of relations 
and a more significant comparison of series are generally 
secured when economic data having time as one variable 
are plotted on paper with a logarithmiG ruling on one axis. 
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For some purposes the process of studying series in time 
^1 have been completed when the data are thus plotted. 
The geMral trend may be roughly determined from the 
chart. The existence of seasonal and other periodic varia- 
lons my be ascertained. Bough comparisons of trends 
and flytuations may be made. AU the knowledge thus 
secured, it shyld be noted, will be non-quantitative in 
character, and the comparisons will be tentative and 
approximate. Even so, such charts enable trends and 
relations to be much more clearly visualized than do the 
raw figmes, and for some purposes the knowledge thus 
secured IS sufficient, though it lacks precision and accuracy, 
or o er purposes more exact measurement and more 

re yd yalysis ye required. Certain appropriate methods 
may be described. 

Forces Affecting Series in Time 

The general o^ect in the analysis of a time series is the 
isolatmn of the effects of one or more of the forces affecting 

w rr ‘^"der that the 

^ series may be understood, in 

OTder that the future behavior of the series may be pre- 
fficted, or in order that two or more series may be compared. 

ind vir!/7 cye yssible to isolate these effects of 
ht with absolute accuracy, and in some cases 

^ approximate such a result. But 
^ sufficiently long period, the effects 
mav yaces upon the behavior of a given series 

may usually be measured with some degree of accu- 

m these forces that affect series of data in time? 

ffiA 0-1 giwn case may be unique, affecting only 

influences acting 
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SECtTLAR TREND 

In the first place, most series of economic statistics exhibit 
definite trends. Such a trend may be constant in direction, 
may change direction at a constant rate, or may even be 
characterized by abrupt shifts in direction or rate that 
reflect the introduction of novel elements. Thus the volume 
of production or sales of a business house over a period 
of years usually shows a fairly regular growth. The sanae 
is true of population, the production of basic minerals, 
the number of motor vehicles registered, etc. In some 
cases the rate of growth may be a negative one, as is true 
of interest rates in the United States over the last half 
century. The concept of secular trend (i.e., trend over a 
long period of time) covers both positive and negative 
changes of this type. 

In the analysis of a time series the trend value at any 
date is taken to be the “noiTnal” value at that date. This 
conception of a normal value which may be used as a base 
or point of reference in judging the effects of all forces other 
than the growth factor, is fundamental in economic analy- 
sis. “No other method,” says Carl Snyder, “enables us so 
quickly to set economic events in their just perspective.” We 
should note, however, that such a normal value is essen- 
tially an empirical construction. While useful for purposes 
of reference, and as one of a series of measurements reflect- 
ing secular movements in a given series, it should not be 
assumed to possess any special normative significance. 

The fact should be emphasized that by secular trend is 
meant the smooth, regular, long-term movement of a statis- 
tical series. Frequent and sudden changes either in absolute 
amounts or in rates of increase or decrease are quite incon- 
sistent with the idea of secular trend. It is true that there 
may be occasional changes due to the interjection of a 
new element or the withdrawal of an old factor. But the 
breaking up into numerous sections of the period covered 
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by a time senes, and the determination of trend for each 

of these minor periods, does violence to the very concent 
of secular change. j Luucepi; 

_ It does not follow from this discussion that a definite 
rising or declimng trend exists for all time series. Many 
senes, such as barometric readings at a certain point 

merely fluctuate about a constant level that does not change 
with the passage of time. • 

PERIODIC FLUCTUATIONS 

If the plotted representation of a time series be studied 
may be discerned in the general upward 
ownwaid drift, but may not be precisely determined 

existence of numerous fluctua- 
fluctuations 

may be regular or irregular, violent or mild, simple or 
comp ex. The value of the variable at any given date 
jy be thought of as the net resultant of the fnteractfon 

modify trend and the various forces that tend to 

modify the persistent secular movements of a given series. 

LltenTo^/tr the trend is the re- 

rqthPT +L ^ ®rplay of a variety of conflicting forces, 

movement upon which the peri- 
dricussl n Actuations are superimposed. In the present 
that f define the organic relations 

These^ftf f “S the forces affecting a series in time.) 
these latter forces may be of several types. 

statisties*^for^^L^^T* series of economic 

Ire Obtainahl r values 

tTes tnwt ^^“Ption and production of commodi- 

and manv clearings, raUroad freight traffic, 

swings rpn i T marked by seasonal 

repeated wtth minor variations year rfter year. 

in chflrnpt^^ definitely periodic 

mnrlfAdl ^ Constant twelve-month period. Less 

y periodic, but nevertheless characterized by a 


SECULAR TREND 


231 


considerable degree of regularity, are the cyclical fluctvMions 
that are found in series affected by forces connected with 
economic or business cycles. Prices, wages, the volume 
of industrial production, trading on the Stock Exchange, 
and most series relating to the activities of individual 
business units are affected by the swings of business through 
alternating periods of depression and prosperity. While 
the length of such periods may vary, the general sequence 
of change has been in the past sufficiently regular to render 
these cyclical movements capable of study. 

KANDOM FLUCTUATIONS 

Entangled with these more or less regular movements 
are the effects of random, accidental, and irregular fluctua- 
tions — catastrophic events such as the San Francisco earth- 
quake, wars, floods, fiires, and countless minor events equally 
fortuitous though less violent in the resulting disruptions. 
These events influence the value of a variable at a given 
date, modifying the effects of long-term movements and of 
seasonal and cyclical factors. The observed value at any 
time is the resultant of the play of all these forces. 

The analysis of series in time involves the isolation of 
the effects of these various forces, so far as this is possible. 
A problem may call for the study of but one factor, or it 
may require the complete breaking up of given values. 
When annual data are used the seasonal element will not 
enter, of course. The explanation of methods begins with 
a consideration of problems involving only this type of data. 

The Mbasueement op Seculae Teend 

As an example of the type of material in connection with 
which these problems arise, the figures in Table 58 on page 
232 may be taken. The values are given in thousands of 
millions in order to simplify the calculations. 

As has been pointed out, the figure for any year, as the 
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1875 

1876 

1877 

1878 

1879 

1880 
1881 
1882 

1883 

1884 

1885 

1886 

1887 

1888 

1889 

1890 


Table 68 


New York Clearing House Transactions, 1875-1936 


of millions) 
1907 95.3 


25.1 

1891 

21.6 

1892 

23.3 

1893 

22.5 

1894 

25.2 

1895 

37.2 

1896 

48.6 

1897 

46.6 

1898 

40.3 

1899 

34.1 

1900 

25.3 

1901 

33.4 

1902 

34.9 

1903 

30.9 

1904 

34.8 

1905 

37.7 

1906 


(In thousands 

34.1 

36.3 

34.4 

24.2 

28.3 

29.4 

31.3 
39.9 

57.4 

52.0 

77.0 

74.8 

70.8 
59.7 

91.9 
103.8 


1908 

73.6 

1909 

99.3 

1910 

102.6 

1911 

92.4 

1912 

96.7 

1913 

98.1 

1914 

89.8 

1915 

90.8 

1916 

147.2 

1917 

181.5 

1918 

174,5 

1919 

214.7 

1920 

252.3 

1921 

204.1 

1922 

213.3 


1923 

214.6 

1924 

235.5 

1925 

276.9 

1926 

293.4 

1927 

307.2 

1928 

368.9 

1929 

456,9 

1930 

399.5 

1931 

287.7 

1932 

177.3 

1933 

154.6 

1934 

162.7 

1935 

174.4 

1936 

186.5 


'‘>34. is the net 

resultant of the many forces that we have classified under 
the headings of secular trend, cyclical variations, and 
om or accidental fluctuations. Our first problem is to 
measure the secular trend. ^ 

the\2w >>»* tarings during 

dpfir,f+a + inclusive, have been plotted. A 

morfor leTsV' well marked and 

metLds Zl from that trend. Several 

thi! trend L at approximations to 

be made "to L- “lay 
.1'’“®*”^ Huetuatiom and to arrive 

growth factor, 'if” the eteaffly operating 

Sor^^S! between the time 

S ” W— on to the 

whai the ,, t “■ ®“f“f S ‘be data by hand gives some- 
same result, the curve being frankly approxi- 
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Fia. 54. — New York Clearing House Transactions, 1875-1936, with Moving Averages 
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matm and empirical m character. In certain studies it 
has been found possible to use one statistical series as base 
or trend line for another series of homogeneous data 

Moving Avebages 

When a trend is to be determined by the method of 
moving averages, the average value for a numbei of year 
(or months, or weeks) is secured, and this average is taken 
as the normal or trend value for the unit of tLe fallfog 


New 


Year 

1912 

1913 

1914 

1915 

1916 

1917 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

1928 

1929 

1930 

1931 

1932 

1933 

1934 

1935 

1936 


Table 59 

1 m^k Clearing House Transactions, 1912-1936 and 
ear Moving Averages 

(In thousands of millions) ' 

Ow/ Tkree^year Five-year Seven-year 

ata moving av. moving av. moving av. 


9 96.7 
98.1 

89.8 

90.8 

147.2 

181.5 

174.5 
214.7 

252.3 

204.1 

213.3 

214.6 
235.5 

276.9 

293.4 

307.2 

368.9 

456.9 

399.5 

287.7 

177.3 

154.6 

162.7 

174.4 

186.5 


3-, 5-, 7-, 


Nine-year 
moving av. 


$ 94.87 
92.90 
109.27 

139.83 
167.73 

190.23 

213.83 
223.70 

223.23 

210.67 
221.13 
242.33 
268.60 
292.50 

323.17 

377.67 
408.43 
381.37 

288.17 

206.53 
164.87 
163.90 

174.53 


$104.52 

121.48 

136.76 

161.74 
194.04 
205.42 
211.78 
219.80 
223.96 
228.88 

246.74 
265.52 
296.38 
340.66 
365. 18 
364.04 
338.06 
295.20 
236.36 
191.34 
171.10 


8125.51 

142.37 

164.40 

180.73 

198.23 

207.86 

215.57 

230.20 

241.44 
249.29 
272.83 
307.63 
334.04 
341.50 
327.27 

307.44 
286.80 
259.01 
220,39 


$149.51 
161.44 
174.24 
188.11 
204. 19 
218.60 
231.03 
245,78 
262.91 
285.64 
307.36 
315.62 

311.48 

302.49 
289.80 
276.58 
263.17 
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at the middle of the period covered in the calculation of 
the average. Table 59 shows the results secured when three-, 
five-, seven-, and nine-year moving averages are thus 
computed for New York clearings for the period 1912-1936. 

The three-year moving average for 1916 is the average 
of the figures for 1915-16-17, the five-year figure for 1916 
is the average of the years 1914-15-16-17-18. The other 
averages are computed in the same way. In each case the 
average is centered for the period included; that is, the 
average is taken to represent normal as of the middle 
of the given period. The employment of an odd number 
of years simplifies this centering process, though it is not 
essential that the number be odd. With an even number 
of years, the figure may be centered by taking a two-year 
moving average of the first moving average. The three- 
and nine-year moving averages for the entire period are 
plotted, with the original data, in Fig. 54. 

It is obvious that the effect of the averaging is to give a 
smoother curve, lessening the influence of the fluctuations 
that pull the aimual figures away from the general trend. 
The longer the period included in securing each average, 
the smoother is the curve secured, though there are other 
factors to consider in deciding upon the length of the period. 
Certain of these factors may be noted. 

CHAEACTEBISTICS OP MOVING AVERAGES 

Given cyclical fluctuations about a uniform level or about 
a line ascending with a uniform slope, the length of the 
cycle and the magnitude of the fluctuations being constant, 
a moving average having a period equal to the period of 
the cycle (or to a multiple of that period) will give a straight 
line, a perfect representation of the trend. Under the 
same conditions a moving average having a period greater 
or less than the period of the cycle will give, not a straight 
line, but a new cycle having the same period as the original, 
but with fluctuations of le^ magnitude. The minima and 
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ma^ma of the cycles thus obtained will not necessaxilv 
coincide mth the minima and maxima of the original evcles 
In general, when such a new cycle is obtained the magnitude 

the period on 

which the average is based. ^ ^ ^ 

T ^ illustrated by the figures in 

Table 60, arbitrarily chosen. In the first example five 

been selected which repeat themselves in 
sequence, fluctuating about a common level. 

The moving averages in columns (2) and (3) represent 


Table 60 


Illustrating the Application 
(1) (2) (3) 

Cyclical Moving average average 


of Moving Averages 
(4) 


(5) 


data 

2 

6 

8 

10 

5 
2 

6 
8 

10 

6 

2 

6 

8 

10 

5 
2 

6 
8 

10 

5 


of 5 items 


H 

6i 

H 

6i 

61 

61 

61 

61 

61 

61 

61 

61 

61 

61 

61 

61 


of 10 items 
{centered) 


Moving average average 


61 

61 

61 

61 

61 

61 

61 

61 

61 

61 


of 3 items 


H 

8 

7| 

5f 

41 

51 

8 

7f 

51 

41 

51 

8 

7f 

5| 

41 

51 

8 

7i 


of 8 items 
(centered) 


5f 

5M 

61 

6M 

6 | 

5f 

511 - 


““ ® “»•»>«> »y «.»» ot . 

btt wi£”“ “*»“«>• o' "» a 
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the data with the cycles completely removed. When the 
period of the average is not equal to the period of the cycle, 
or to a multiple of that period, the cycle is not removed, 
as is apparent from the figures in columns (4) and (5). 

The conclusions suggested above hold when the cyclical 
fluctuations take place about any straight line. In Table 61 
the foregoing data have been employed but with a constant 
increment of 3. This is equivalent to superimposing the 
same cycles upon a line with a slope of + 3. 

Table 61 

Illustrating the Application of Moving Averages to a Series with 

Linear Trend 


(1) (2) (3) (4) (5) 

?l f * • 


Cyclical 

data 

2 

Moving average 
of 5 items 

Moving average 
of 10 items 
{centered) 

Moving average 
of 3 iUnis 

Momg average 
of 8 items 
(centered) 

9 



81 


14 

12-1 


14 


19 

151 


16! 


17 

181 


17f 

181 

17 

211 

211 

191 

21H 

24 

241 

241 

23-1 

24f- 

29 

271 

271 

29 

26| 

34 

301 

301 

31 § 

29l-g 

32 

331 

331 

32| 

331 

32 

361 

361 

341 

36H 

39 

391 

391 

381 

39» 

44 

421 

421 

44 

41i 

49 

451 

451 

461 

44H 

■ 47 

481 

481 

47! 

48i 

47 

511 


491 

51H 

54 

541 


531 


59 

571 


59 


64 ■ 



61 1 


62 





(The items in columns (3) and (5) have been centered hy means of a 

moving average of 2 items.) ' 



The trend values, with the effect of the cycles completely 
removed, are secured by taking moving averages equal in 
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period to the cycle or to a multiple of that period. The 

cycle persists with the same period but with diminished 

amplitude when the average is based upon a period not 

equal to that of the cycle, as is clear from the figures in 
columns (4) and (5). ngures in 

men tdiese ideally simple conditions of constant Deriod 
and amphtade do not exist, the moving 
more ainbi^ous and its interpretation less simple. If the 
period of the cycle varies, the selection of a period fo h 
moving average is more difficult. In general, a pLiS 

• K ° T average length of the cycle 

IS to be selected. An average having a shorter pSod 5 

give a line that is marked by pronounced cycS thTse 

jcles being reduced as the period covered in the calculation 
of the average increases. ‘i^^uiauon 

coSTt*"? 'Mi-S 

length of the cycle will give a line of trend marked bv 
mnor cycles The amplitude of these secondary cycle! 
wil be a minimum when the period of the average is Toul 

en these last two irregularities are combined and the 

^ ^ length, the moving average giving the most 

aritv.“lTtbT!,°V^®"-^ the trend departs from line- 

Si m • ^derlymg trend of a series is concave upward 

It the reverse is true, and the trend is convex uoward a 
wiU always be less than the aci trend 

The fiCTres^tn^^^ following examples. 

C^le oTonni T ® when a 

y constant period and amplitude, as in col. (3), is 
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superimposed upon a line of trend that is concave upward, 
i.e., increasing at a constantly increasing rate. If the mov- 
ing average could completely eliminate the effects of the 
cycle, the values seemed from the average would be equal to 
the average value of the five items hi each cycle (6.2) plus 
the values of the function y = x^, given in col. (2). 

Table 62 


Illustrating the Application of Moving Averages to a 
Non-Linear Series 
(Increasing rate) 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 



Cyclical 

CoL ( 2 ) phis 

Moving average 

Trne trend 



data 

col ( 3 ) 

of 5 items 

PlllMSo 

6 . 2 ) 

0 

0 

2 

2 



1 

1 

6 

7 



2 

4 

8 

12 

12.2 

10.2 

3 

9 

10 

19 

17.2 

15.2 

4 

16 

5 

21 

24.2 

22.2 

5 

25 

2 

27 

33.2 

31.2 

6 

36 

6 

42 

44.2 

42.2 

7 

49 

8 

57 

57.2 

55.2 

8 

64 

10 

74 

72.2 

70.2 

9 

81 

5 

86 

89.2 

87.2 

10 

100 

2 

102 

108.2 

106.2 

11 

121 

6 

127 

129.2 

127.2 

12 

144 

8 

152 

152.2 

150.2 

IS 

169 

10 

179 

177.2 

■ 175.2 

14 

196 

5 

201 

204.2 

202. 2: , 

15 

225 

2 

227 

233.2 

,,231.2 

10 

. 256 

6 

262 

264.2 

262.2 

17 . , 

289 

8 

297 

297.2 

295.2 

18 

324 

10 

334 



19 

361 

5 

366 




The values of the moving average are, in this case, in 
excess of the true trend values, a form of distortion that 
will always occm with a series of this type. 

In Table 63 are shown the results of superimposing the 
same cyclical values upon a line of trend that is convex 
upward, i.e., increasing at a constantly decreasing rate. 
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In this case, a perfect method of eliminating the cycles 
would give results equal to the average value of the 
Items (6.2) plus the values of the function y = V5 

too V^' are consistently 

too low. Tl^e discrepancy is greatest for the lower valued 

for "toe vLtr" “ 


Ilhistrating the 


( 1 ) ( 2 ) 
X Vi 


0 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 
19 


0 

1.00 

1.41 

1.73 

2.00 

2.24 

2.46 

2.65 
2.83 

3.00 
3.16 

3.32 

3.46 
3.61 
3.74 
3.87 

4.00 
4.12 

4.24 
4.36 


(3) 

Cyclical 

data 

2 

6 

8 

10 

5 
2 

6 
8 

10 

5 
2 

6 
8 

10 

5 
2 

6 
8 

10 

5 


Table 63 

Application of Moving Averages 
Non-Linear Series 
(Decreasing rate) 

W ) (5) 


to a 


Col. (2) phis 
col. (3) 

2.00 

7.00 

9.41 

11.73 

7.00 

4.24 
8.45 

10.65 

12.83 

8.00 
5.16 

9.32 

11.46 

13.61 

8.74 

5.87 

10.00 

12.12 

14.24 

9.36 


Moving average 
of 5 items 


7.428 

7.876 

8.166 

8.414 

8.634 

8.834 

9.018 

9.192 

9.354 

9.510 

9.658 

9.800 

9.936 

10.068 

10.194 

10.318 


( 6 ) 

True trend 
values 

(Vi + 6. 2) 


7.61 

7.93 

8.20 

8.44 

8.65 
8.85 
9.03 
9.20 
9.36 
9.52 

9.66 
9.81 
9.94 

10.07 

10.20 

10.32 


a Ivera reviewed have indicated that 

period aUeair f general, be based upon a 

y qual to some higher multiple of that period whL the 
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data are at all irregular. The longer the period covered, 
the greater the stability of the average. But when the 
underlying trend departs naaterially from the linear form, 
following a curve bending upward or downward, the error 
involved in the use of any moving average increases as 
the period of the average increases. If a moving average 
is used in such a case to measure the trend, the period of 
the average should be the shortest which will serve to 
average out the cycles; equal, that is, to the average length 
of one cycle. 

In practice, however, these various conditions ai'e found 
in complicated combinations. The fact that cycles vary in 
amplitude and length calls for a moving average based 
upon a fairly long period. The fact that the trend of the 
data is usually non-linear calls for a short period average 
to lessen the upward or downward distortion. A considera- 
tion of some importance in practical work is that a moving 
average can never be brought up to date. The lag is less, 
of course, the shorter the period covered by the average. 
The selection of a period in a given case must rest upon a 
study of the actual data with these various considerations 
in mind. 

It has been assumed in the preceding discussion that the 
purpose of the moving average is the representation of 
secular trend. The moving average may be used, also, in 
smoothing data for the purpose of elinainating random 
fluctuations. For this purpose a moving average based 
upon a period shorter than the average length of the cycle 
should be selected. 

We may return now to the problem relating to New York 
bank clearings. A study of the lines marked out by the 
different moving averages in Fig. 54 reveals significant 
differences between them. The three-year average follows 
the graph of the original data most closely, as would be 
expected. The nine-year average marks out the smoothest 
line of trend, but, on the other hand, departs most widely 
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from the data. This is particularly noticeable from 1893 to 
1898, from 1911 to 1916, from 1921 to 1926, and from 1927 
to 1931. It is due to the pronounced changes in the rate 
of gro^h of the series during these periods. Except for 
these distortions the general trend seems to be most accu- 
rately represented by the nine-year average. 

In determining the relative merits of the different moving 
averages we are aided by a knowledge of the course of 
business during the period covered. The volume of New 
York bank clearings is a sensitive index of general business 
conditions, responding i^ediately to changes in specula- 
tive and industrial ^ activity. Major and minor business 
cycles are reflected in this series. Knowing the number of 
cycles through which business has passed during the period 
1875-1936, we may determine which of the moving averages 
serves ^ best as a standard from which to measure cyclical 
deviations. In this case we are practically working back- 
ward from a known result, a method not always available. 

If we take as a starting point in each cycle the year in 
which revival began, after recession, the following cycles 
m general business activity may be distinguished:^ 

1871-1879 1908-1912 

1879-1886 1912-1914 

j, 1915-1919 

‘ 1888-1891 1919-1921 

1891-1894 1921-1924 

1894-1897 1924-1927 

: 1897-1900 1927-1933 

i 1901-1904 1933_ 

i ! 1904-1908 

The cycles marked out by the three-year moving average 
are too numerous to enumerate. In fact, the deviations 
from this average are primarily accidental and minor 

devSnS chronology of American business cycles 

n^ cSe of ^‘■pduction during the American Busi- 

to W NationKr ’ Arthur F. Bums, Bvir 

Ltrf thatX> cwfV Research, November 9, 1936. It should 

House data cited to the ^ud 
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fluctuations and should not be classed as cycles. Deviations 
from the five-, seven-, and nine-year averages mark out the 


following cycles: 
Cycles of denations 

Cycles of deviations 

Cycles of deviations 

from five-year 

from seven-year 

from nine-year 

moving averages 

moving averages 

moving averages ■ 

1879-1885 

1879-1885 

1879-1885 

1885-1888 

1885-1888 

1885-1888 

1888-1891 

1888-1894 

1888-1897 

1891-1897 

1894-1900 

1897-1900 

1897-1900 

1900-1904 

1900-1904 

1900-1904 

1904-1908 

1904-1908 

1904-1908 

1908-1911 

1908-1916 

1908-1911 

1911-1915 

1915-1923 

1911-1915 

1915-1918 

1923- 

1915-1918 

1918-1923 


1918-1924 

1923-1927 


1924-1927 

1927-1932 


1927-1932 

1932- 


1932- 




Some of the differences between the series of cycles thus 
determined and the reference cycles distinguished by 
Mitchell are doubtless due to the distinctive behavior of 
New York clearings. Other differences reflect the peculiari- 
ties of moving averages. Deviations from the five-year 
averages between 1879 and 1927 show one more cycle than 
we find in the series based on seven-year averages, four 
more cycles than are shown by the nine-year averages. 
And yet the deviations from five-year averages fail to show 
the cycles of 1894-1897 and of 1921-1924. The nine-year 
averages reveal only eight cycles between 1879 and 1927, 
as against Mitchell’s fourteen reference cycles. Mitchell 
was working, of course, with monthly data which are 
more sensitive than annual data to cyclical forces; More- 
over, he was dealing with relatively short movements, some 
of which appear as only minor fluctuations in general business 
activity. 

If interest attaches to the shorter swings of business, 
to cycles with average durations of three or four years, 
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a moving average of relatively short period should be used 
A five-year average is appropriate. Averages of longer 
period define trend movements more faithfully, but may 
fad to reveal fluctuations properly classified as business 
cycles We should refer, however, to recent attempts to 
estabhsh the reahty of long cycles, of nine, eleven, or as 
many as thirty years in average duration. In the study 
of such cycles moving averages of corresponding periods 
would be employed. 

general, the moving average has the prime advantage 
ot flexibihty. The representation of secular trend by mathe- 
matical curves frequently involves the breaking up of a 
period into two or three subdi\dsions, and the fitting of 
separate curves to each. This results from changing condi- 
tions and sharply changing rates of growth or decfine. 
Where such changes occm- the momng average has the 
merit of flexible adaptation to the new conditions and is 
^ten a more effective measure of secular trend than curves 
fitted with great labor. 

_ Simple and weighted movhig averages, in varying com- 
binations, have wide uses in the analysis of economic time 
series An illuminating discussion of these uses, and of the 
D appropriate to different purposes, is to be found 
m I he Smoothing of Time Series, by Frederick R. Macaulay.^ 

Representation op Secular Trend by Mathematical 

Curves 

For many types of data the secular trend may be repre- 
sented by a mathematical curve rather than by a line 
based upon a moving average. Thus, if the growth (or 
decline) IS by coMtant absolute increments (or decrements) 
a straigM hue will serve as an exact representation of the 
. ' + 1 , ^ growth may be by constant percentages, 

^ in e case of capital increase, when a principal sum 
increases m accordance with the compound interest law. 

* National Bureau of Economie Research, New York, 1931 . 



MATHEMATICAL CURVES 


245 


A curve of a definite mathematical form furnishes the best 
representation of this trend. In many series of economic 
statistics the data seem to conform to definite laws of 
growth, or decline, and where this is the case the task of 
analysis, interpretation, and projection is materially assisted 
by securing a mathematical expression for the underlying 
law. In practically all cases, of course, there are departures 
from this law, deviations above and below the line of secular 
trend. These deviations, however, do not destroy the value 
of an equation that describes the underlying law of develop- 
ment. 

There is one fundamental difference between the moving 
average as a measure of trend and such mathematical 
curves. The former implies no definite “law” to which 
the data are assumed to conform. It is based upon the 
data as given; if the general trend changes, the moving 
average follows the new trend. It is a flexible measure of 
trend, adapting itself to changing conditions, purporting 
to be nothing more than an empirical approximation to 
the drift of the series. Mathematical curves fitted to eco- 
nomic series are, in fact, nothing more than empirical 
approximations also, but in a somewhat different sense. 
They assume a “law” of change underlying the variations, 
accidental and otherwise, which show upon the surface of 
the data. It is an empirical law which is assumed, it is 
true, but nevertheless there is postulated a uniform and 
consistent trend capable of mathematical expression. If 
such an assumption is to have any validity it is essential 
that the period during which the law is supposed to hold 
be homogeneous, that there be no material changes in 
the conditions affecting the series being studied. Thus 
an equation is secured for the trend of gold production, 
let us say. If a radical change should take place in methods 
of extraction the trend of gold production would change 
materially and the former equation would no longer apply. 
Data covering the period before and after such a change 



246 MEASUREMENT OF TREND 

would not be homogeneous, and a single equation for the 
trend during the whole period should not be secured. 

In the practical approach to a problem involving the 
determination of secular trend the first task is the selection 
of the appropriate type of curve. This is perhaps the most 
difficult part of the work; certainly it is the part in which 
the element of personal judgment enters most directly. 
For there is no objective rule to follow, no fixed standard 
by which the most appropriate curve may be selected. 
Something more will be said on this subject after the 
characteristics of the chief types of curves and the methods 
of fitting them have been described. For the present it 
may be assumed that a curve sinailar to one of the tjpes 
described in Chapter II, or to a related form, has been 
selected, and that we face the practical task of fitting it to 
the data. 

FITTING A STEAIGHT LINE; THE METHOD OP LEAST SQUARES 

If the data, when plotted, show a trend that can best 
be represented by a straight line the task of fitting is 
merely the determination of the constants in an equation 
of the form y = a 4- bx. The values of a and b which 
will give a line following most closely the trend of the 
' data are to be obtained. A simple illustration may serve 
to demonstrate the various methods which may be employed. 

! Nine points (1, 3; 2, 4; 3, 6; 4, 5; 5, 10; 6, 9; 7, 10; 8, 12; 

I \ 9, 11) are plotted in Fig. 55. Our problem is the fitting of 
a straight line to these points. 

By inspection approximate values of a and b may be 
determined. A thread may be stretched through the 
I i points in such a direction that it seems to follow the trend 
iti as closely as possible. The slope of the line thus laid out 
i I r may be measured, the y-intercept determined, and the 
\tl desired equation thus approximated. Obviously this is a 
loose and uncertain method, and the results obtained by 
_ different individuals may be expected to vary rather Avidely. 


THE METHOD OP LEAST SQTARES 247 


There is one and only one straight line that fits the plotted 
data naost accurately. The constants for this line of best 
fit may be determined by the method of least squares. 

The theory upon which the method of least squares is 
based need not be detailed at length here. The argument 
, maj" be briefly presented: A number of observatioa Yalues 

I of a certain quantity are found, and it is desired to obtain 

the most probable value of the quantity which is being 



Fig. .'55. — Illustrating the Fitting of a Straight Line to Nine Points 


measured. It is capable of demonstration that the most 
probable value of the quantity is that value for which the 
sum of the squares of the residuals is a minimum. (The 
“residual” is a term for the difference between a given 
estimated value and an observation value.) This is true 
of the arithmetic mean of the observation values. Thus, 
if a given distance be measured by a number of individuals, 
with varying results, the most probable value is the arith- 
metic mean of the different measurements. The process 
of computing the mean involves the following steps, which 
are enumerated for the purpose of simplifying the later 
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explanation. We seek a result, a statement of the most 
probable value of the distance being measured, which will 
take the form: 

M = (a constant). 

Let us say we have three approximations to this value: 

M= 5,672 feet 
M = 5,671 feet 
M = 5,676 feet 
adding, 3M = 17,019 feet. 

Since there is but one unknown, M, it may be derived 
directly from this equation, and we have 

M = 5,673 feet. 

This is the value for which the sum of the squares of the 
deviations is a minimum. 

A similar problem arises when the relation between two 
variables is being measured. Our goal in this case is the 
equation that correctly describes tMs relationship. We 
have seem-ed, however, varying results which do not agree 
precisely as to the constants in the equation of relationship. 
In other words, our plotted points do not all lie on the 
same line. What are the most probable values of the con- 
; stants in the required equation? The answer is analogous 
I to that given when a single quantity was being measured. 

1 We seek the constants which, when the resulting equation 
I is plotted, will give a line from which the deviations of 
jthe separate points, when squared and totaled, will be a 
, 1 minimum. Assuming that each pair of measurements gives 
' ' an approximation to the true relationship between the 
variables, we wish to find the most probable relationship, 
and this is given by the line for which the sum of the squared 
deviations is a minim um .^ 

, I We have, in the present example, nine pairs of values for 
[ p and y. Substituting these values in the generalized form 

i ^ Cf. Appendix A for a more detailed discussion of the method of least squares, 
|i|igether with a description of certain checks upon the calculations. 
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of the linear equation, y = a + bx, we secure the following 
observation equations: 

3 = a + lb 

4 = a + 26 

6 = “b 36 

5 — u 46 
10 = a + 56 

9 = a + 66 

10 = a + 76 
12 = a "1“ 86 

11 = a + 96. 

Any two of these equations could be solved as simultaneous 
equations, and values of a and b secured. But these values 
would not satisfy the remaining equations. Our problem 
is to combine the nine observation equations so as to secure 
two normal equations, which, when solved simultaneously, 
will give the most probable values of a and 6. The first 
of these normal equations is secured by multiplying each of 
the observation equations by the coefficient of the first 
unknown (a) in that equation, and adding the equations 
obtained in this way. Since the coefficient of a in the present 
case is 1 throughout, the nine observation equations are 
unchanged by the process of multiplication. The second 
of the normal equations is secured by multiplying each 
of the observation equations by the coefiicient of the second 
unknown (6) in that equation, and adding the equations 
obtained. Thus the first equation is multiplied throughout 
by 1, the second by 2, and so on. The process of securing 
the two normal equations is illustrated in Table 64 on 
page 250. 

The two normal equations are 

70= 9a + 456 
418 = 45a + 2856. 

It remains to solve these equations for a and b. By multi- 
plying the first equation by 5 and subtracting it from the 

second, a may be eliminated; a value of or 1.133, is 
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Table 64 

Derivation of Normal Equations from Observation Equations 


3 = 

4 = 
6 - 

5 = 
10 = 

9 = 
10 = 
12 - 
11 - 


a + 
a + 
a + 
a + 
a + 
a + 
a + 
a + 
a + 


15 

25 

35 

45 

55 

65 

75 

85 


3 ^ 
8 

18 ' 
20 
50 
54 : 
70 
96 
99 


l a * 4 " 
2(3' “f" 
3(3 Hh 
4a + 
5a 4“ 
6a 4" 

7 a 4- 
8a 4- 
9a 4- 


15 

45 

95 

165 

255 

365 

495 

645 

815 


70 =- 9a + 455 


418 = 45a + 2855 


found for b. Substituting this value in either of the equa- 
tions, a value of 2.111 is secured for a. The equation to 
the best fitting straight line is, therefore, 

y = 2.111 -f 1.133X. 

In the actual application of the method it is not necessary 
to write out and total the equations, as is done above. 
We need only insert the proper values in the two equations, ‘ 

X{y) — na + b'Eix) 

-2{xy) aZ{z) + bSix^). 

The symbols employed have the following meanings: 

2 ( 2 /): the sum of the values of 2 /- 
2(x) ; the sum of the values of a:. 

2 (x 2 /) : the sum of the products of the paired x’s and y’s. 

2(x^) : the sum of the squares of the values of x. 

n: the number of pairs of values; the number of points 
plotted. 

The work of computation is facilitated by a tabular 
arrangement similar to that shown in Table 65. 

The two desired normal equations are secured by sub- 
stituting these five values in the type equations given 
above. It will be noted that the results are identical with 
those obtained from the observation equations. 

' General rules for the formation of normal equations are given in Ap- 
pendix A. 
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Table 65 

Ccmpulction oj Value. in FiMn, a SImm Um 


X 

1 

2 

3 

4 

5 

6 

7 

8 

A 

45 


y 

3 

4 
6 

5 
10 

9 

10 

12 

11 

70 


xy 

3 

8 

18 

20 

50 

54 

70 

96 

J9 

418 


x’^ 

1 

4 

9 

16 

25 

36 

49 

64 

285 


n = 9 
2(x) = 45 

2(2/) = 70 
S(a;2) = 285 
S(xy) = 418 


I ‘l"® equation to the best fitting straight line h., 

b«n obtanred the values of , correspondtog to ?ven“ u“ 

values. Table 66 presents the results secured: 

Table 66 

C^mpari.mofOhmmlandCcmpuUd Value, of a VaHableQuanmy- 


3 

4 
6 

5 
10 

9 

10 

12 

11 


y 

(computed) 

3.2f 
4.3f 
5.5i 
6.61 
7.7| 
8.9i 
10.01 
11.11 
12.3^ 


xd 

- .2f 

- .71 
+ 1.4| 

- ,6.5| 

■+ 11.11- 
+ .5f 

.Bi 
+ 6.51 
-11.8.; : 
, , O,.0 . 

Pfied bv the^ deviations when each is multi- 

Suacv value of t: is also aero. Tie 

ecuiacy of the actual calculations involved in fitting may 

ttSs.rs'srigiS"".'""'™ 


-I- .4# 
- 1.6| 
-4-2.2# 
+ .Of 
— . 0 # 
-f .81- 
- 1.3i 


d} 

.0597 

.1427 

.2390 

2.7041 

4.9381 

.0079 

.0020 

.6760 

1.7190 


0 0 10.4885 



252 


MEASUEEMENT OF TREND 


be tested in this way. The sum of the squares of the devia- 
tions 10.4885, IS a minimum. Any change in the value of 
a or b would give a line from which the sum of the squared 
deviations would exceed 10.4885. 

FITTING A STEAIGHT LINE; SPECIAL CASES 

The simultaneous solution of the two normal equations 
wdl give, in any case, the most probable values of a and b. 
The processes of calculation may be simplified in certain 
special cases, not infrequently encountered in handhne 
economic data. If the x’s are consecutive numbers, as they 
always are when an unbroken time series is plotted, the 
origin may be taken at the median value. When the number 
of observation is odd this will be the middle item, of 

course. The value of S(r) will theu be aero, and the nor^ml 
equations become 

S(^) — m 
S(x2/) = 6S(z2). 

Thus if a time series extends, by years, from 1901 to 1937, 

+ taken at 1919, the value of x correspon ding 

to 1918 being - l to 1920, 4- 1, and so on. The solution 
tor values of a and b is rendered much easier when the 
data may be disposed in this way. When there is an even 
number of years the same process is possible, time (the 
x-vanable) being measured in units of one half year. 

values of x are consecutive positive 
umbers starting at zero, the values of 2(a:) and of 
may be easily determined. The sum of the first n natural 

numbem is equal to Thus the sum of the numbers 

from 1 to 5 is ^ , or 15. -This term may replace 2(a:) 

the noimal equations. Similarly, the sum of the squares 
of the first n natural numbers is equal to " 
Thus the sum of the squares of the numbers from 1 to 5 
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is equal to g = 55. This expression may replace 

S(x^) in the normal equations, and we have 

2{j/) = Tin + ^ 

It is sometimes easier to work from equations in this form 
than in the form first given. The data for time series may 
be handled in this way, the years being numbered consecu- 
tively, beginning with 1. 

FITTING A CURVE OF THE POWER SERIES 

The discussion above has been confined to the case of 
linear trend. Such a type of curve frequently gives an 
excellent fit, but in many cases it fails accmately to fit the 
data. This difficulty is sometimes overcome in practice 
by breaking a series into segments and fitting a separate 
line to the data for each of these periods. Where there is 
an actual break in the series, the period as a whole lacking 
homogeneity, this practice may be justified, but when the 
period is essentially homogeneous the whole concept of 
secular trend is violated by this process of subdividing 
and fitting separate lines. In many cases where a straight 
line will not fit, a curve of the power series may represent 
the trend accurately. The general process of fitting such a 
curve may be briefly described. 

The generaUzed form of the equation of the type desired 
is y = a + bx + cx^ + dx^ + ... . An equation of this 
form does not, of course, represent a curve of the parabolic 
type, but in ordinary usage that term is applied to the 
potential series. If carried to the second power of a; it is 
called a second degree parabola; if to the third power, a third 
degree parabola, etc. For ordinary purposes such a curve 
should not be carried beyond the second or third power of x. 
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>« “Ived «multane„.«ly i„ 

e- ‘“ “ f” 

coefficient of the first unt-n multiplied by the 

■•«.Jti„6 equation® ”<> ‘1>» 

equation. The nrn,.^ • ^ normal 

knowns, and the\rernlr other un- 

for a, CUd : TtXrt^^ 
values of these thr^A ! x probable 

general forms which the thr!r*^' following are the 

men tfie three normal equations take: 

^(y] = m + bz(x) + ci:(x% 
y^Tl ~ bZ(x^) + cZ(x^). 

2(a; y) = a2(a;2) + ^,2(2.3) ^ 

in fitting a second deJrL^n^^^r’i *^^icuIations involved 
7; 4, 8; 5, lo g 117 1’ 2; 2, 6; 

If is of the greatest nr ^ outlined, 

as in aU extensive caE^'r^'^ miportance in curve fitting, 
and carried on in a definit”^^’ work be laid out 
each step definitely related tn +il^ ®y®*®^atic fashion, with 
operations. Checks shnnU k ■ ^ P^^^o^ng and succeeding 

as “^athematiS PossMe 

v'ork. A tabular arran^A 

operation being revealed ^ generally advisable, each 
presented. The Hn+o ■ °f results clearly 

ranged as in Table 67*^ ^ Present problem may be ar- 

When the a-’s 

as in the present caTe“"^hr''V''*®®®'’® ^^^inning with 1, 
and 2(a:4) n^ay fie securArt^f 

^ oe secured from prepared tables.^ 

^ Of. Table XXVITT p 

•» - sSr “ 

uix. ianie ix. of the present volume. 
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Table 67 

Computation of Values Required in Fitting a Second Degree Parabola 


X 

y 

xy 


xhj 



1 

2 

2 

1 

2 

n -- 

= 9 

2 

6 

12 

4 

24 

Ux) - 

= 45 

3 

7 

21 

9 

63 

2(a:“) = 

= 285 

4 

8 

32 

16 

128 

2(a:s) = 

= 2,025 

5 

10 

50 

25 

250 

2{x^) - 

= 15,333 

6 

11 

66 

36 

396 

m -- 

= 74 

7 

11 

77 

49 

539 

^{xy} - 

= 421 

8 

10 

80 

64 

640 


= 2,771 

9 

9 

81 

81 

729 



45 

74 

421 

285 

2,771 




Substituting these values in the equations given above, 
the following normal equations are secured : 

74 = 9a + 455 + 285c. 

421 = 45a + 2856 + 2,025c. 

2,771 = 285a + 2,0256 + 15,333c. 

When these equations are solved simultaneously the 
following values are secured for the three constants : 

a — — .929. 

6 = + 3.523. 
c = -.267. 

The equation of the desii*ed curve is 

y = - .929 + 3.523a; - .2671^. 

This curve and the nine given points are plotted in Fig. 56 
on page 256. 

If the values of x are consecutive, as in the present 
example, the work of computation is lightened if the mid- 
value is taken as origin. In this case S(a;) and 2(x^) are 
equal to zero, and the normal equations become 

fSy — na -V cS(x^). 

•Eixy) = 6S(a;2). 

S(a:V) = aS(x^) + cZ{x*). 
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When a third degree parabola of the form y == a +hx 
+ ex* 4- dx® is to be fitted to data, four constants must be 
determined, and four normal equations are necessary. 
These are of the following form: 

S(y) = na + 6S(x) + cS(x*) + dS(x*). 

Sixy) = aS(x) + bS(x*) + eS(a:«) + d2(x<). 

S(x*y) = a2(x*) + b2(x^) + eSfr-*) + dSCx*^). 

S(x*t/) = a2(x*) + hi:ix*) + cS(x'‘) + dS(x®). 


The solution for four or more constants involves a con- 
siderable amount of arithmetical calculation, and there is 



Fig. 56. — Blustrating the Pitting of a Second Degree Curve to Nine 

Points 


some question as to the advisability of representing secular 
trend by equations of this type. With a sufficient number 
of constants a curve may be fitted which will follow every 
variation in the data, but such a curve could hardly be taken 
to represent the long time trend. ^ Minor departures from 

^ Regarding the employment of potential series of the type indicated for 
representing empirical curves, Steinmetz states that their use is justified: 

1. If the successive coefficients a, 6, c . . . decrease in value so rapidly that 
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a simple and uniform trend, linear or other-R'ise, are to be 
expected with economic data, but, if a real trend exists, 
extreme departures from a fairly simple form are rare. If 
such departures are due to pronounced changes in conditions 
no single line of trend is likely to be satisfactory, and it is 
advisable to break the period into parts, with a separate 
line of trend for each part. “Empirical curves,” says 
Steinmetz, “can be represented by a single equation only 
when the physical conditions remain constant within the 
range of the observations.” Though this statement relates 
to the fitting of curves to data from the physical sciences, 
it applies equally well to economic data. 

Determination op the Secular Trend op a 
Business Series 

PITTING A straight LINE 

The procedure of fitting certain types of curves to simple 
data has been illustrated in the preceding sections. Before 
proceeding to a discussion of slightly different forms, it 
will be helpful to examine concrete examples of trend 
determination. We first determine the secular trend of 
a series defining the number of concerns in business in the 
United States, during the period from 1899 to 1914.^ The 
observations are given in Table 68, together with the values 
required for the fitting of a straight line to the data, and 
the derived trend values. The values of x represent the 
time factor, while the values of y are the corresponding 
numbers of business concerns. Only the entries in columns 

{Footnote 1 continued from jpage 256.) 

within the range of observation the higher terms become rapidly smaller 
and appear as mere secondary terms. 

2. If the successive coefficients follow a definite law, indicating a convergent 

series which represents some other function, as an exponential, trigono- 
metric, etc. 

3. If all the coefficients are very small, with the exception of a few of them, and 

only the latter ones thus need to be considered. Cf. C. P. Steinmetz, 
Engineering Mathematicsili^m Yorky McGraw-Hill, 1917, 214-215. 

^ Data compiled by Dun and Bradstreet. 
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(2) to (5), it should be noted, are employed in the fitting 
process. 

Table 68 


Number of Concerns in Business in the United (States, 1899-1914 

Computation of values required in fitting line of trend 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Year 

X 

V 

xy 


Vc 






Trend values 


No. of concerns 



{linear) of no. 



in business^ 



of concerns 



in thousands 



in business, 






in thousands 

mm 

1 

1,148 

1,148 

1 

1,152 

1900 

2 

1,174 

2,348 

4 

1,184 

1901 

3 

1,219 

3,657 

9 

1,217 

1902 

4 

1,253 

5,012 

16 

1,250 

1903 

5 

1,281 

6,405 

25 

1,283 

1904 

6 

1,320 

7,920 

36 

1,316 

1905 

7 

1,357 

9,499 

49 

1,349 

1906 

8 

1,393 

11,144 

64 

1,382 

1907 

9 

1,418 

12,762 

81 

1,415 

1908 

10 

1,448 

14,480 

100 

1,448 

1909 

11 

1,486 

16,346 

121 

1,481 

1910 

12 

1,515 

18,180 

144 

1,513 

1911 

13 

1,525 

19,825 

169 

1,546 

1912 

14 

1,564 

21,896 

196 

1,579 

1913 

15 

1,617 

24,255 

225 

1,612 

1914 

16 

1,655 

26,480 

256 

1,645 

Totals 

136 

22,373 

201,357 

1,496 



N 

= 16 

2(2/) = 

= 22,373 




= 136 

'Eixy) = 

= 201,357 


2{x^) 

= 1,496 





The equations to be solved in determining the required 
constants are of the form 


S(j/) = Na + 6S(x) 

2 (x 2 f) = d2ix) + VZix% 

Inserting the given values in the formulas, we have 

22,373 = 16a + 1366 
201,357 = 136a + 1,4966 
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from which 

a = 1,118.65 
b = 32.90. 

The equation to the line of best fit is therefore 
z/ = 1,118.65 + 32. 90® 
with origin at 1898. 

The trend values derived from this equation appear in 
column (6) of Table 68. The original data and line of 
trend are plotted in Fig. 57. The fit for the period covered 



is good. The number of concerns in business in the United 
States during the sixteen years before the World War is 
well defined by the straight line we have secured. 
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FITTING A POWEE GTJEVE OF THE SECOND DEGREE 

The record of commercial failures in the United States 
over the last forty years provides an example of a series 
following a definitely non-linear trend. Data for the period 
1897-1933 are presented in Table 69, with accompany ing 
computations. 

Table 69 


Commercial Failures in the United States, 1897-1933 ‘ 
Computation of values required in fitting line of trend 


(1) 

(2) 

(3) 

(4) 

(5) 

Ymr 

X 

y 

{No, of 
failures) 

xy 

x^y 

1897 

- 18 

13,351 

- 240,318 

4,325,724 

1898 

~ 17 

12,186 

- 207,162 

3,521,754 

1899 

- 16 

9,337 

- 149,392 

2,390,272 

1900 

- 15 

10,774 

- 161,610 

2,424,150 

1901 

- 14 

11,002 

- 154,028 

2,156,392 

1902 

- 13 

11,615 

- 150,995 

1,962,935 

1903 

- 12 

12,069 

- 144,828 

1,737,936 

1904 

- 11 

12,199 

- 134,189 

1,476,079 

1905 

- 10 

11,520 

- 115,200 

1,152,000 

1906 

- 9 

10,682 

- 96,138 

865,242 

1907 

- 8 

11,725 

- 93,800 

750,400 

1908 

- 7 

15,690 

- 109,830 

768,810 

1909 

~ 6 

12,924 

- 77,544 

465,264 

1910 

5 

12,652 

- 63,260 

316,300 

1911 

- 4 

13,441 

- 53,764 

215,056 

1912 

- 3 

15,452 

- 46,356 

139,068 

1913 

- 2 

16,037 

- ■ 32,074 

■ 64,148 

1914 

- 1 

18,280 

- 18,280 

18,280 

1915 

0 

22,156 

0 

0 

1916 

1 

16,993 

,■16,993 

: 16,993 

1917 

2 

13,855 

■ '■27,710 

55,420 

1918 

3 

9,982 

: 29,946 . ■ 

■ 89,838 

1919 

4 

6,451 

, . 25,804 

■ 103,216 

1920 

5 

8,881 

44,405 

222,025 

1921 

6 

19,652 

117,912 

707,472 

1922 

7 

23,676 

165,732 

1 , 160,124 

1923 

8 

18,718 

149,744 

1 , 197,952 

1924 

9 

20,615 

185,535 

1 , 669,815 


* Dun and Bradstreet. 
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Table 69 — Continued 


Commercial Failures in the United States, 1897-1933 


(1) 

( 2 ) 

(3) 

(4) 

( 5 ) 

Year 

X 

y 

xy 


1925 . 

10 

21,214 

212,140 

2 , 121,400 

1926 

11 

21,773 

239,503 

2 , 634,533 

1927 

12 ■ 

23,146 

217, m 

3 , 333,024 

1928 

13 

23,842 

309,946 

4 , 029,298 

1929 

14 

22,909 

320,726 

4 , 490,164 

1930 . 

15 

26,355 

395,325 

5 , 929,875 

1931 

16 

28,285 

452,560 

7 , 240,960 

1932 

17 

31,822 

540,974 

9 , 196,558 

1933 

18 

19,626 

353,268 

6 , 368,824 

Totals 


610,887 

+ 1,817,207 

75 , 307,301 


N = 37 
S(a;) = 0 
S(a:2) = 4,218 
= 0 


^(x^) = 864,690 
X{y) = 610,887 
7:(xy) = 1,817,207 
Z(x^) = 75,307,301 


The origin is taken at the middle year to facilitate the 
calculations. The values of S(a:®) and S(x^) may be secured 
from prepared tables, or from the formulas cited on page 254 . 

The normal equations to be solved in fitting a second 
degree parabola, with the origin at the middle year of the 
period covered, are of the form 

S(y) = Aa + cS(x2) 

-Z{xy) = hUix^y 
'2,{x^y) — aZ{x^) + cS(a:^). 

Inserting the appropriate values, we have 

610,887 = 37a + 4,218c 
1,817,207 = 4,2185 
75,307,301 = 4,218a + 864,690c. 

Solving for the constants, 

a = 14,827.6 
h = 439.82 
c = 14.762. 
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The required equation is 

y= 14,827.6 + 430.82a; + 14.762x2 
with origin at 1915. 

The original observations and the line of secular trend 
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Fig. 58. — Commercial Failures in the United States, 1897-1933, with 

Line of Trend 

are plotted in Figure 58. Observations, trend values and 
deviations from trend are given in Table 70. 

Commercial failures reflect the major cycles in American 
business, but with movements that reverse those of most 
economic series. Failures are numerous in times of depres- 
sion, fewer in prosperity. The reader who will compare the 
deviations from trend shown in Table 70 with the dates 
of reference cycles given on an earlier page will note the 
general agreement. The sharp fall in business failures 
from 19S2 to 1933 reflected, of course, the special conditions 
prevailing in the latter year. 
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Year 

1897 

1898 

1899 

1900 

1901 

1902 

1903 

1904 

1905 

1906 

1907 

1908 

1909 

1910 

1911 

1912 

1913 

1914 
1916 

1916 

1917 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

1928 

1929 

1930 

1931 

1932 

1933 


Commercial Failures 
Actual Values, Trend 

Number of 
commercial 
failures 

13,351 
12,186 
9,337 
10,774 
11,002 
11,615 
12,069 
12,119 
11,620 
10,682 
11,725 
15,690 
12,924 
12,652 
13,441 
15,452 
16,037 
18,280 
22,156 
16,993 
13,855 
9,982 
6,451 
8,881 
19,652 
23,676 
18,718 
20,615 
21,214 
21,773 
23,146 
23,842 
22,909 
26,355 
28,285 
31,822 
19,626 


Table 70 

in the United Stales, 1897-1933 
Values, and Deviations from Trend 

Trend value, Deviatimi of 

dctual from trend 

T 


second degree 
parabola 


11 , 855.73 

11 , 769.88 

11 , 713.55 

11 , 686.75 

11 , 689.47 

11 , 721.72 

11 , 783.49 

11 , 874.78 

11 , 995.60 

12 , 145.94 

12 , 325.81 

12 , 535.20 

12 , 774.11 

13 , 042.55 

13 , 340.51 

13 . 668.00 

14 . 025.01 
14 , 411.54 
14 , 827.60 
15 , 273.18 
15 , 748.29 

16 . 252.92 
16 , 787.07 
17 , 350.75 
17 , 943.95 
18 , 566.68 

19 . 218.93 
19 , 900. 70 
20 , 612.00 
21 , 352.82 
22 , 123.17 
22 , 923.04 
23 , 752.43 
24 , 611.35 
25 , 499.79 
26 , 417.76 
27 , 365.25 


value 


■+ 


-f 

■+ 

+ 

+ 

+ 

+ 

+ 

4 - 

+- 

4 - 


1 , 495.27 
416. 12 

2 . 376. 55 
912. 75 
687.47 
106. 72 
285.51 
324.22 
475.60 

1 , 463.94 
600.81 
3 , 154.80 
159.79 

390.55 
100.49 

1 . 784.00 
2 , 011.99 
3 , 868.47 
7 , 328.40 
1 , 719.82 
1 , 893.29 
6 , 270.92 

10 , 336.07 

8 , 469.75 

1 , 708.05 

5 , 109.32 

500.93 

714.30 

602.00 
420.18 

1 , 022.83 

018.96 

843.43 

1 , 743.65 

2 , 785.21 

5 . 404.24 

7 . 739.25 


4 - 

4 - 


+ 

+ 

4 

+ 

4 

4 

4 

4 
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The second degree curve employed to define the trend 
of commercial failures does so with reasonable accuracy 
over the period here covered. Extrapolation beyond those 
limits would be hazardous. Indeed, the changed conditions 
under which banking and many other types of business 
were conducted after 1933 may weU break the continuity 
of the series, and generate a new long-term trend. 

The Use op Logarithms in Curve Fitting 

The family of curves described above represents a simple 
and^ very useful type. Perhaps of even greater general 
utility , in^ the analysis of time series, are curves of a semi- 
logarithmic type. The advantages of plotting many series 
of data on semi-logarithmic or “ratio” paper were explained 
in an earlier section. A fundamental virtue of this type 
of plotting is that it presents a true picture of relative 
variations, of ra^'os between magnitudes. Relations of 
this type are ordinarily of primary interest in the analysis 
of economic data, and it is logical that determination of 
trends should proceed on the same basis. 

In doing so, we can make use of a group of curves of the 
same general form as those already described, the one 
difference bemg that log y takes the place of y throughout. 
That IS, the straight line form is log y = a + hx, while the 
general form for the potential series is log ^ = a +})x + cx^ 
+-^ -b. . . . . The curves secured may be constructed on 
anthmetic paper, plotting the natural aj’s and the logarithms 
of the or natural values of both a;’s and y’s may be 
plotted on semi-logarithmie paper, the logarithmic scale 

fu ^ IZ-axis. The latter is the simpler 

method. 

To illustrate the procedure, the steps involved in fitting 
a curve of the type log y ^ a + bx will be shown. The 

low production in the United States from 

1922 to 1929 IS to be determined. The values needed in 

tlie noriuai equations are derived from Table 71. 
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Table 71 


Petroleum Prodmtion in the United States, 1922-1929 
(Computation of values required in fitting line of trend) 


Tear ■ 

X 


y 

logy 

X • log y 

1922 

1 


557.5 

2.74624 

,2. 74624 

1923 

•2 


732.4 

2.86475 

5. 72950 

1924 

3 


713.9 

2.85364 

8.56092 

1925 

4 


763.7 

2.88292 

11.53168 

1926 

5 


770.9 

2.88700 

14.43500 

1927 

6 


901.1 

2.95477 

17.72862 

1928 

7 


901.5 

2.95497 

20.68479 

1929 

8 


1,007.3 

3.00316 

24.02528 





23.14745 

105.44203 


N 

= 8 


S(log^) = 23.14745 



SCr) 

= 36 


Zix-logy) = 105.44203 




= 204 





The two normal equations to be solved are of the form 

2 (log y) = Na + bUx 
S(a: • log y) = dSz + bSx^. 

Substituting the given values we have 

23.14745 = 8a + 365 
105.44203 = 36a + 2046. 

Solving for the constants, 

a = 2.75645 
6 = .03044. 

The equation to the desired curve is, therefore, 

logy = 2.75645 + .03044a: 
with origin at 1921. 

In fitting this curve by the method of least squares, as 
is done above, we satisfy the condition that the sum of 
the squares of the logarithmic deviations shall be a minimum. 
That is, the deviations to which this condition relates are 
the differences between the logarithms of the observed 
values and the logarithms of the corresponding trend values. 
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This curve, it should be noted, is not the same as that 
from which the sum of the squares of the arithmetic (natural) 
deviations is a minimum. 

The substitution in the above equation of the value of 
X representing any given year will enable the logarithm 
of the trend or normal value to be calculated. The trend 
value in natural numbers may then be determined. In 
Table 72 the normal value for each of the years covered 
is given, together with the percentage relation of actual to 
normal. 

Table 72 

Trend of Petroleum Production in the United States, 1922-1929, 
with Comparison of Actual and Trend Values 
(Straight line trend determined from logarithms of production figures) 


Year 

X 

y (actual) 
Production 
(in 7nilUons 
ofbbls.) 

hguc 

Log of trend 

Vc 

('//, computed) 
Trend value 
(in millmis 
ofbbls.) 

Percentage rela- 
tionof actual 
to trend 

1922 

1 

557.5 

2.78689 

612.2 

91.1 

1923 

2 

732.4 

2.81733 

656.6 

111.5 

1924 

3 

713.9 

2.84777 

704.3 

101.4 

1925 

4 

763.7 

2.87821 

755.5 

101.1 

1926 

5 

770.9 

2.90865 

810.3 

95,1 

1927 

6 

901.1 

2.93909 

869.1 

103.7 

1928 

7 

901.5 

2.96953 

932.2 

96.7 

1929 

8 

1 , 007.3 

2.99997 

999.9 

100.7 


The points representing the actual production, together 
with the line of trend, are plotted in Fig. 59. The graph 
of the derived equation gives a good representation of 
the trend in the present instance. 

An equation of this type, defining a linear trend in the 
logarithms of the dependent variable, has certain dis- 
tinctive advantages. The reader will note that this is the 
logarithmic form of an equation to a compound interest 
curve (an exponential curve). This equation was given 
in Chapter II as 
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2/ = p(l + rY 
or 

log 2 / = log p + log (1 4- r)x. 

In the example Just given we have used the symbol a for 
log p and the symbol h for log (1 + r), but the equations 
are identical. 



Fig. 69. — Production of Petroleum in the United States, 1922-1929, 
with Line Defining Average Rate of Growth 


We may readily change to natural numbers the constants 
in the equation defining the trend of petroleum production 
from 1922 to 1929. We have 

log?/ = 2.75645 + .03044r 

where 2.75645 is log p and .03044 is log (1 + r). The 
natural number corresponding to 2.75645 is 570.8; the 
natural number corresponding to .03044 is 1.0726. The 
trend of petroleum production in natural form is, therefore, 
2/ = 570.8(1.0726)^ 

with origin at 1921. 

Subtracting 1 from the constant 1.0726 we secure .0726, 
which is r, the rate of increase of a series growing in accord- 
ance with the compound interest law. (If, on subtracting: 
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1, we have a negative value, the growth is negative, of 
course.) This measure indicates that the production of 
crude petroleum increased at an average rate of 7.26 per 
cent a year between 1922 and 1929 (r being multiplied by 
100 to place it on a percentage basis). 

When the trend of a series in time may be defined by a 
straight line on ratio paper, and it is surprising how widely 

1500 
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Pig. 60. — Production of Crude Petroleum in the United States, 
1918-1936, mth Line of Trend 


applicable such a function is, the constant r is a highly 
useful measure. It defines the average annual rate of 
growth or decline of the series. It is, of course, an abstract 
measure and thus has the great merit of permitting com- 
parison of the trends of series relating to widely different 
original units. The rate of growth of population, over a 
given period, may have been 1.4 per cent per year; the 
production of gasoline may have increased at a rate of 
4.5 per cent, the production of automobiles at 4.2 per 
cent, the production of wheat at 1 . 1 per cent, total national 
income at 1 . 6 per cent, total national debt at 3 . 2 per cent. 
The trends of these series are immediately comparable, and 
conclusions concerning the direction and character of a na- 
tion’s development may be drawn. This measure provides a 
valuable device for the study of social and economic change.* 
^ In any extensive application of this procedure time and labor niay he 
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Table 73 

Production of Crude Petroleum in fhe tj i r, 

(Trend values determined from second degree parabol. , 

production figure? logarithms of 


y {actual) 
Production 
(in millions 
of hbls.) 

335.9 

378.4 

442.9 
472.2 

557.5 

732.4 

713.9 
763.7 

770.9 

901.1 

901.5 
1,007.3 

898.0 
850.3 

785.2 
905.7 

908.1 

996.6 
1,098.5 


y (computed) 
Trend value 
(in millions 
of bbls.) 
345.0 
395.5 

449.2 

505.2 

562.8 

620.8 

678.2 

733.8 

786.3 

834.3 

876.8 

912.4 

940.5 
960.0 

970.4 

971.5 
963.2 
945.7 

919.6 


Percentage 
relation of 
actual to 
trend 

97.4 
95.7 
98.6 

93.5 
99.1 

118.0 

105.3 
104.1 

98.0 

108.0 

102.8 

110.4 

95.5 

88.6 
80.9 

93.2 

94.3 

105.4 

119.5 


marked by with a series 

Pm-iod, as is done in ^^6^ k n J 1 

line secured for the period 

addition of a third conirnt ® ^appropriate. The 

a tmrd constant gives an equation of the type 

■' ' . y.^ O' hx cx^, 

ri W^a^SSrSorjrw^^ Ja^es W. Glover, Tobies of 

By the UM of this table the^^eompound^ei^ T' 1923, 468ff.). 

numbers. All necessary comStrr^^^tlSj 
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In fitting this to the data of petroleum production for 
the period 1918-1936, we may follow the exact proceduie 
used when y was the dependent variable in a similar equa- 
tion (see page 261), except that log y is used throughout, 
instead of j/. For the required equation we have 

log?/ = 2.921331 -I- .023660a: - .002107a:2 

with origin at 1927. This is shown graphically in Fig. 60. 
Actual and trend values, in natural terms, are given in 
Table 73 on page 269. 

Othbe Cxjeve Types 

The two families of curves described in the preceding 
sections meet most of the needs of the economic statistician. 
The trend in most time series may be described by curves 
of the power series, fitted either to natural numbers or 
to the logarithms of the data (that is, to the logarithms 
of the y values; time, the a;-variable, is treated in terms of 
natural numbers in fitting both the above types of curves). 
These classes constitute flexible and widely applicable curve 
forms.^ Attention may be called to several other curve 
types which have been applied less extensively to time 
series, but with favorable results in particular cases. 

Curves of the ordinary parabolic type {y = ax'*) are not 

^ There are available for fitting higher degree curves of the power series 
methods that lessen the labor involved, particularly if curves of different degree 
are to be fitted to the same data. These methods, which reduce the fitting 
process to a series of simple adding machine operations, are appropriate to 
extended research projects. Their use is not advisable, however, unless work 
involving a considerable number of routine operations is contemplated. It is 
desirable that the student master the basic least squares procedures outlined 
in the proceeding pages, utilizing other methods only in case extended computing 
tefikfi are undertaken. 

For accounts of systematic methods suited to extensive calculations, 

R, A. Fisher, Statistical Methods for Research Workers^ Edinburgh, Oliver 
and Boyd, Sixth edition, 1936, 148-156; Max Sasuly, Trend Analysis of StaHs-- 
tics: Theory and Technique^ Washington, Brookings Institute, 1934. The ap- 
plication of the method of.orthogonal polynomials described by Fisher is ad- 
mirably exemplified in James W. Angeil, The Behavior of Money, New York, 
McGraw-Hill, 1936, 195-202. 
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generally applicable to economic data in the form of time 
series, as their use involves the treatment of the time 
variable as a geometric series. Such a curve, it will be 
recalled, becomes a straight line on double logarithmic 
paper. Yet if a curve of this form serves accurately to 
describe the trend of a given series, its use is justified, 
empirically. 

Such curves may be fitted most readily by employing 
logarithms and using an equation of the linear type. The 
equation 

y — ax‘‘ 

becomes, in logarithmic form, 

log y — loga + b log x. 

The two normal equations needed in fitting such a curve 
are of the form 

S(log 2/) = n log a + 6S(log x) 

S(log X • log y) = log aS(Iog x) + 6S(log 

By substituting the values computed from the data, these 
equations may be solved for log a and b, just as in fitting 
an ordinary .straight line. ^ 

The equation to the simple exponential curve may be 
written 

y = ar’^. 

(The r in this equation is the equivalent of 1 + r, as given 
on p. 267.) This equation may be used to define the trend 
of a series increasing or decreasing in geometric progression. 
It has been observed that the trends of economic series 
frequently depart from such a geometric progression by 
constant magnitudes. By adding this magnitude, in a 
given case, to the original series (or subtracting it), a 

* A useful table of the sums of the logarithms of the natural numbers from 
1 to 100 is included as an appendix to Medical Biometry and Statistics, by 
Raymond Pearl, Philadelphia, Sunders, 1923. 
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modified series with a clear exponential trend may be 
secured. The trend of the original series may be written 

y = K ar^ 

where K is the constant magnitude by which the series 
departs from a geometric progression. A modified exponential 
curve of this type may give a highly satisfactory representa- 
tion of trend, in certain cases. The method employed in 
fitting such a curve is discussed in Appendix D. 

Some use has been made, in the interpretation of eco- 
nomic statistics, of the Gompertz curve, the equation to 
which was originally developed in the actuarial field. The 
equation is 

y — ah'^^. 

Its use in the analysis of economic statistics has been based 
upon the argument that there is a general law of growth 
characteristic of population increase, and that this same 
type of growth is found in industries whose products are a 
direct function of the growth of population. 

A somewhat similar curve of growiih, the “logistic,” has 
been employed by Verhulst and more recently by Raymond 
Pearl and Lowell J. Reed in forecasting population growth. 
This curve has been found to describe the trends of cer- 
tain economic series. Examples of the procedures employed 
in fitting Gompertz and logistic curves are given in Appen- 
dix D. 

The Determination op Monthly Trend Values 

The procedures so far described have dealt with annual 
measurements only. Having fitted a line or curve to annual 
data it is frequently necessary to make a transition to 
monthly units. Problems involving such monthly measure- 
ments axe faced in the study of cyclical movements which 
are discussed in the next chapter. 

The constant a in the trend equation defines the trend 
value in the year taken as origin. If the annual data 
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employed in the fitting processes are averages of twelve 
monthly values (e.g., the average price of pig iron in 1937) 
the constant a measures the trend value for a month cen- 
tering at the middle of the year covered by the annual 
figures. If the annual data are aggregates of twelve monthly 
values (e.g., total production of pig iron in 1937) the constant 
a must be divided by 12 to obtain the trend value for the 
month centering at the middle of the year. 

If the trend be linear, the constant b in the equation 
y = a + bx defines the change due to trend over a twelve- 
month period. In interpolating for monthly trend values, 
the increment (or decrement) from month to month (e.g., 

from January to February of the year 1937) is if the 

an nual data employed in the fitting process are averages 
of monthly values. The increment from month to month is 

JJ7 if the annual data are aggregates of monthly values. 


The one further step needed is properly to center the 
monthly trend values. These should, of course, be centered 
at points of time corresponding to those to which the 
actual monthly data relate. In averaging, or aggregating, 
monthly data relating to the middle of each of the twelve 
months in a calendar year we secure a figure centering at 
July 1. The month centering at the middle of the year 
of origin thus centers at July 1. For comparison with actual 
monthly data, we desire trend values centering at July 15, 
August 15, etc. At the beginning, therefore, we must add 
to the trend value for the month centering at the middle 

of the year of origin ^that is, to a or to one-half of the 


month-to-month increment (or decrement) that we have ob- 
tained from b of the trend equation. This procedure gives us 
the trend value for the month centering at July 15. This value 
may be compared with the actual value recorded for that 
month. The addition to this of the month-to-month trend 
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increment (or decrement) gives trend values for all following 
months; subtraction gives trend values for all precedine 
months.^ ® 


The Selection of a Cueve to Repbesent Teend 

^ Various types of curves which may be fitted to represent 
the trend of economc data over a period of time have been 
described. But which of these many types is to be selected 
m a given case? Which will give the best standard of 
normality for each of the years covered? Several references 
to this problem have been made in the preceding sections' 
but no general principles have been laid down. And in 
fact, no general principles can be evoked to answer this 
fundanaental question. There is no absolute test of goodness 
of fit in such cases. It is largely a matter of personal judg- 
ment as to the type of curve which best represents the 
trend in a given instance, and experience must play a 
dominant part in such judgments. But certain general 
considerations are of assistance in selecting the appropriate 
type of curve. 


, .3' selection of a curve type is the 

p otting of the data. When this has been done, it is fre- 
quently possible by inspection to determine the appropriate 
torm. The data may be plotted in four different combina- 
tioi^, of which the first two are of chief importance in 
deahng with economic material. 


^ given figures on ordi- 

^ nary arithmetic paper.) 

b. ^'^**™*, y- (Plot the a:’s on the natural scale, and plot 
Sper.) scale; i.e., use semi-logarithmic 

^ rather 

the DroceRs of ^ 1 course. If the trend equation m non-iinear, 

sbn^of apDroVna^ correspondingly modified. For a discus- 

with the general nrinoinl«K *“ca<ier is referred to any treatise dealing 

by Whitaker and IloIimLn ^ ^^^^’polation. Tke Calmlm of Obierpaiims^ 
oy Whitaker and liobmson, eontaim an excellent treatment of this topic. 
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c Natural 2/ j log x. (Plot on semi-logarithmic paper, with the 
a;-scale logarithmic.) 

d. Log gj log X, (Plot on paper with logarithmic ruling on both 
scales.) 

If in any of these cases a straight line trend is secured, 
a type of equation which plots as a straight line under the 
given conditions (cf. Chapter II) would be selected. If a 
Mnear equation is not appropriate some other simple type 
may be suggested by the plotted data. In studying such 
graphs for the purpose of selecting a curve to represent 
trend, one should be familiar with the curves representing 
all the simpler equations. 

2. The appropriate curve may be determined by a study 
of the relations between the two variables, x and y. In 
the simpler cases the following relations hold : ^ 

0, If, when the values of x are arranged in an arithmetic series, 
the corresponding values of y form a geometric series, the 
relation is of the exponential type, described by the equa- 
tion 

y = ah*. 

h. If, when the values of x are arranged in a geometric series, the 
corresponding values of y form a geometric series, the rela- 
tion is of the simple parabolic or hyperbolic type; described 
by the equation 

y = ax^, 

€. If, when the values of x are arranged in an arithmetic series, 
the first differences of the correspoiiding are constant, the 
relation is of the straight line type, described by the equa- 
tion 

y = a + bz. 

The differences between successive y values, when x's are arranged in an 
arithmetic series, are termed “first differences” or “first order differences’* 
and are represented by the synibol A?/. The differences between successive 
first differences are called “second differences” and are represented by the 

^ It will be recalled that an arithmetic series changes by a constant absolute 
increment, whil^ a geometric series ■changes by. a constant percentage. 
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symbol ^^2/. Differences of higher order are similarly derived. The following 
table illustrates the formation of differences: 


X y 
1 11 

2 40 

3 101 

4 206 

5 367 

6 596 

7 905 

8 1,306 

9 1,811 

10 2,432 


Ay 

29 

61 

105 

161 

229 

309 

401 

505 

621 


A^y 

32 

44 

56 

68 

80 

92 

104 

116 


A^y 


12 

12 

12 

12 

12 

12 

12 


d. If, when the values of x are arranged in an arithmetic series, 
the i^th differences of the corresponding t/^s are constant, the 
relation between the variables is described by an equation of 
the potential series carried to the nth power of x ; that is, by an 
equation of the type 

y = a + bx + cx^ + dx^ + . . . + qx^. 

Thus, in the example given above, in which the third differ- 
ences are constant, the relation between x and y would be 
described by an equation of the form 

^ ^ = a + 6a; + cx‘^ + dx^. 

When one is selecting a curve to use in the analysis of 
economic data, he will rarely, if ever, find these tests to 
be met perfectly. This would happen only when the curve 
chosen passed through all the plotted points. But data 
in a given case will generally approximate some one of 
the conditions described above, and the appropriate type 
of curve will be indicated. 

3. If the study of the original data does not render a 
definite decision possible, several types of curves may be 
fitted to the data and: "the decision made by comparing 
the results. If the equations to the curves being compared 
contain the same number , of constants, a comparison of 
the root«mean*square '.deviations about the curves furnishes 
a conclusive and valid test of the closeness of the fit within 
the limits of the data. ' 
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The root-mean-square deviation may be readily computed 
by making use of the following relationship 

S(d==) = S(y^) - aS(2/) - 6S(xy) - cE{x^y) - . . . 

where S(d^) is the sum of the squares of the deviations 
about the line of trend. (The derivation of this equation is 
explained in Appendix A, in which a generalized form is 
given.) If the equations do not contain an equal number 
of constants, a test of this sort is invalid and the comparison 
can only be made by inspection. Personal judgment as 
to the curve which represents the trend most accurately 
must be the basis of the decision in such cases. 

It should be remembered that the closeness of fit within 
the limits of the data is not of itself a final criterion. An 
equation could be secured, having a number of constants 
equal to the number of points, which would give a curve 
passing through every point plotted, yet such a curve 
would not necessarily represent the trend. The concept 
of a trend is of a regular, smooth underlying movement, 
from which there are deviations, but which marks the long- 
time tendency of the series. In general, therefore, the curve 
should be of simple form, if it is to be consistent with the 
concept of secular trend. This does not mean, however, 
that a complex trend can be represented by a simple curve 
which fails to conform to the plotted data. 

4. An important question to be answered before the 
form of curve can be selected relates to the limits within 
which the line of trend is to be used. If it is to be used only 
within the limits of the plotted data (i.e., for interpolation) 
one set of considerations governs the choice of a curve. If 
it is to be projected beyond the limits of the data, used as 
a basis for the determination of normal during a subsequent 
period, other considerations enter. In the former case a 
reasonable fit to the data is the sole requirement; in the 
latter ease it is necessary, in addition, that the trend of the 
projection be logical, and consistent with the past record. 
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The fact should be clearly recognized that projection, or 
extrapolation, represents a guess, justified only on the 
assumption that a proper line of trend has been fitted and 
that the same conditions that affected the series in the 
past will prevail in the future. A change in conditions, the 
introduction of new elements, renders the projection invalid. 
When dealing with economic statistics, moreover, it is 
ordinarily impossible to tell, except in retrospect, when a 
change has taken place. Conclusions drawn from the pro- 
jection of aline of trend are always subject to error, therefore. 
In practical statistical work such projections are made, 
and are justified on the ground that the most probable 
course in the future is that which prevailed in the past. 
Projections into the distant future are, of course, subject 
to wider margins of error than short-time projections. 
Lines of trend should be revised from time to time, there- 
fore, as new data become available. 

When a projection is to be made, a simple curve with few 
constants is to be preferred to a more complicated one. A 
third or fourth degree parabola may give an excellent fit 
to the data in a given case, but the projection of such curves 
is inadvisable. It is well to remember, as Perrin has pointed 
out, that a curve suitable for interpolation may not be at 
all adapted to extrapolation. 

The avoidance of distortion of trend lines by abnormal 
conditions in the terminal years of the period studied is 
particularly important when a trend is to be projected. 
Reference is made to this point in the next chapter. 

It seems to be true, in general, that simple curves fitted 
to the logarithms of the y’s give more reliable results when 
projected than curves fitted to the natural numbers. In 
an interesting discussion of this point, Karl G. Karsten* 
argues that phenomena characterized by a uniform rate 
of change are more likely to maintain their trend than 
phenomena marked by a uniform amount of change. It is 

• Kar! Karsten, Charts and Graphs, New York, Prentice Hall, 1923, 423-425. 
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the semi-logarithmie curves, of course, which best measure 
rotes of change. 

5. It is frequently true that no one curve will fit a given 
series during the entire period it is desired to study. This 
may be due to changes in conditions which cause the trend 
to be altered. Thus the trend of wholesale prices was 
downward, in a direction well represented by a straight 
line, from the close of the Civil War to 1896. From 1896 
to the beginning of the World War the trend was upward, 
and could be described by a second degree parabola. From 
1921 to 1929 the trend was also curvilinear, rising to 1925, 
declining thereafter. Similar changes occur in many eco- 
nomic series. By breaking the entire period up into sections, 
appropriate lines of trend may be fitted to the several 
periods thus marked off. This process may be carried to 
a quite illogical extreme, however. The concept of trend 
is of a gradual, long-term change, and the breaking up 
of a series in order to fit a number of trend lines is contrary 
to the whole conception. It may be justified upon occasion 
when a real change in conditions occurs, but in all cases 
the attempt should be made to represent the trend during 
the whole period by a single line. 

Deflation as a Step in Analtsis 

Many series of economic data are expressed in monetary 
units, in dollars, pounds, or francs. Such series are subject 
to distortion because of changes in the price level. Thus 
the value of heavy engineering contracts awarded in the 
United States in 1913 amounted to approximately 601 
millions of dollars; in 1929 the value of engineering contracts 
awarded in the same territory amounted to approximately 
3,950 millions of dollars.^ Was the volume of engineering 
construction in 1929 over six times that in 1913? It was 
not. The value of construction contracts awarded in any 
year depends not only upon the actual volume of construe- 

^ FigureB compiled by Engineering News Record. 
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tion but also upon the costs of construction materials and 
labor, and these costs increased substantially from 1913 
to 1929. If we wish to measure the change in the volume 
of construction alone, these values must be corrected for 
the increase in construction costs between 1913 and 1929. 
Such a process is termed de^ofe'on.’^ 

The selection of an appropriate deflating index is the 
central problem in such cases. For the present purpose we 
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Fig. 61. — Comparison of Actual and Deflated Values of Contracts 
Awarded in Engineering Construction, 1913-1936 


may use an index of constructive costs, based upon the 
prices of steel, cement, and lumber, and upon wage rates 
in construction industries, compiled by the Engmeermg 
News Record. This index shows that construction costs 
in 1929 were approximately’ 107 per cent higher than, in 

^ The term deflation m not inappropriate when correction is being made for 
an advance in prices; it is less suitable •when correction is made for a fail in 
prices. The period selected asi a standard of reference may be one in which 
prices were relatively high; division by a price or cost index resting on suc!i a 
ymr as base will raise values relating to other periods. The word deflation 
Is a convenient one to use for this general process, however. In using it in 
this somewhat technical sense we must understand it to mean correction for 
changes in the value of the dollar (as measured by specific indices of prices or 
costs). 
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Actual 


Year 


1913 

1914 

1915 

1916 

1917 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

1928 

1929 

1930 

1931 

1932 

1933 

1934 

1935 

1936 


Table 74 

arul Deflated Values of Contracts Awarded 
Construction, 1913-1936 


Contracts awarded, 
engineering con- 
struction {monthly 
average, in thou- 
sands of dollars) * 

50,117 
48,574 
48,740 
77, ns 
61,592 
82,729 
97,991 
126,923 
99,459 
129,716 
158,670 
166,593 
213,287 
237,820 
271,147 
298,215 
329,193 
264,438 
202,693 
101,609 
89,031 
113,383 
132,513 
198,904 


Index of 
construction 
costs 1 

1.000 

.886 

.926 

1.296 

1.812 

1.892 

1.984 

2.513 

2.018 

1.745 

2.141 

2.154 

2.067 
2.080 
2.062 

2. 068 
2.070 
2.029 
1.814 
1.570 
1.702 
1.981 
1.952 
2.065 


in Engineering 

Deflated value of 
contracts awarded 
(montMy average, 
in thousands of 
dollars) 

50,117 
54,824 
52,635 
60,014 
33,991 
43,726 
49,391 
50,507 
49,286 
74,336 
74,110 
77,341 
103,187 
114,337 
131,497 
144,205 
159,030 
130,329 
111,738 
64,719 
52,310 
57,235 
67,886 
96,322 


W13 ^ index is 100 for 1913, 207 for 1929). Dividing 
the 1929 aggregate by 2.07, to correct for the change T b 

taken to measure the aggregate value nf 
- gmeermg contracts awarded in 1929, when the 1913 

of money is assumed to be held constant with ref- 
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erence to the year which is the base of the price or cost 
index used as a deflator.) If the deflating index may be 
accepted as an accurate measure of changing costs, the 
deflated series may be assumed to define changes in the 
actual volume of engineering construction. The effects of 
changing prices and wages will have been eliminated. 

The general procedure is illustrated in greater detail in 
Table 74 on page 281. Actual and deflated series are plotted 
in Fig. 61. The degree to which changing monetary values 
distorted the construction series may be readily appreciated 
from the diagram. 

Most value series are affected by price changes, and it is 
generally advisable to correct for this factor before further 
analysis is attempted. Each case presents a new problem, 
for no general deflating index is suitable to all series. The 
index of wholesale prices compiled by the United States 
Bureau of Labor Statistics has been used extensively in 
deflating economic data expressed in dollar values, but this 
index is not at all appropriate in many of the cases in 
which it has been employed. It is absurd, for instance, to 
deflate money wages by an index of wholesale prices. The 
deflating index employed should be a measure of price 
changes as they affect the series being deflated. 

The deflation of a value series is in general a first step 
in the study of that series. The way is then open for further 
analysis by methods explained in the present and succeeding 
chapters. 
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THE ANALYSIS OF TIME SERIES: MEASUREMENT 
OF SEASONAL AND CYCLICAL FLUCTUATIONS 

The measurement of secular trend is but one of the 
problems connected with the analysis of a series in time. 
Such series, it has been pointed out, are subject to periodic 
fluctuations, seasonal and cyclical in character, and these 
fluctuations are generally more important in their effects 
upon business than is the long-time trend. Our present 
concern is with methods of isolating such periodic variations. 
The series, in Table 75, which clearly reflects the seasonal 
and cyclical swings of domestic trade in the United States, 
may be used to illustrate methods of measuring these 
movements. 

Table 75 


Average WeeUy Freight Car Loadings in the United States, 
1918-1927 > 

(Unit: 1,000 cars) 


Month 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

January 

655 

728 

820 

706 

696 

848 

859 

891 

920 

944 

B’ebmary 

763 

687 

776 

685 

757 

854 

908 

906 

932 

956 

March 

842 

696 

848 

691 

818 

916 

916 

926 

' 960 

998 

April 

873 

721 

730 

706 

716 

941 

874 

932 

966 

969 

May 

897 

759 

862 

760 

776 

975 

895 

971 

1,018 

1,004 

June 

918 

796 

896 

762 

831 

1,012 

906 

■992 

1,052 

; 1,021. 

July 

970 

887 

901 

750 

813 

985 

881 

975 

1,037 

978 

August 

962 

892 

969 

810 

853 

1,042 

969 

1,073 

1,106 

1,073 

September 

956 

960 

967 

842 

925 

1,037 

1,037 

1,074 

1,140 

1,093^ 

October 

925 

967 

1,005 

932 

978 

1,070 

1,091 

1,107 

1,184 

1,1.01 

November 

819 

807 

884 

764 

957 

964 

976 

1,024 

1,042 

926 

December 

719 

758 

755 

681 

832 

827 

869 

925 

868 

814 

Average 

857' 

805 

868 

757 

829 

956 

932 

983 

1,018 

990 


^ Data from the Anmml Bulletin of the American Railway Association and 
the Survey of Current Business. The published figures have been slightly re- 
vised, to take account of calendar variations. 
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For the present purpose the study of seasonal and cyclical 
variations in freight car loadings is limited to the period 
1918-1927. The disturbances of the ensuing period, com- 
bined with changes in railroad operating methods and busi- 
ness practices, materially modified the behavior of this series. 
The demonstration of statistical procedure will be clearer 
if restricted to the relatively homogeneous period here cov- 
ered. 

The Measurement of Seasonal Fluctuations: Moving 

Averages 

Moving averages provide a useful method of defining 
seasonal variations. Since these fluctuations take place 
within a constant period of twelve months, a moving average 
may be used with more confidence than when a cycle of 
varying length is involved. The magnitude of the fluctua- 
tions (the amplitude ai the seasonal swings) will not ordinarily 
be constant, hence the line marked out by the moving 
averages will not be completely free of seasonal influences. 
The relation of the actual monthly items to the moving 
averages may be averaged, however, and the indices of 
seasonal variation based upon these averages. 

It is essential, of course, that the moving average, cen- 
tered, fall at the same date as the original figure with 
which it is to be compared. This involves a second process 
of averaging. For example, the weekly averages of freight 
car loadings relate to the middle of each month. The 
average of the twelve monthly items for 1918, when centered, 
falls on July 1st. The average of the items from February, 
1918, through January, 1919, centered, falls on August 1st. 
To secure a figure comparable with the July 15th average, 
these two must be averaged. By this process of computing 
a two-month moving average from the twelve-month aver- 
age, comparability with the original figures is secured. 
Table 76 presents averages obtained in this way for the 
period from July, 1918, to June, 1927. 
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Table 76 

Moving Averages of Freight Car Loadings, 1918-1927 

(12-month moving average, centered, adjusted by 2-month moving average, 

centered) 


Month 

1918 

1919 

1920 

(Unit: 1,000 cars) 
1921 1922 1923 

1924 

1925 

■ 1926 

1927 

Jan. 


808.0 

860.8 

809.6 

783.7 

915.8 

935.9 

957.3 

1,004.8 

1,019.1 

Feb. 


801.7 

854.6 

796.7 

788,1 

930.9 

928.5 

965.6 

1,008.7 

1,015.3 

Mar. . 


798.9 

858.1 

784.9 

793.4 

943.4 

925.5 

971.5 

1,012.8 

1,012.0 

Apr. 


800.8 

860.0 

778.8 

798.8 

951.9 

926.4 

973.7 

1,018.8 

1,CK)6.5 

May 


802.1 

864.8 

768.8 

808.7 

956.0 

927.8 

976.3 

1,022.8 

998.3 

June 


803.2 

867,9 

760.5 

823.0 

956.1 

930.0 

980.7 

1,020.7 

d9!.6 

Jiily 

860.5 

808.7 

863.0 

767.0 

835.7 

956.4 

933.1 

984.2 

1,018.9 


Aug. 

860.8 

816.2 

854.5 

759.6 

846.0 

959.1 

934.3 

986.5 

1,020.9 


Sept. 

851.9 

826.3 

844.1 

767.9 

854.2 

961.3 

934.7 

989.0 

1,023.6 


Oct. 

839.5 

833.0 

836.6 

773.6 

867.6 

958.5 

937.5 

991.8 

1,026.2 


.Nov, 

827.4 

837.6 

831.3 

774.7 

885.3 

952.4 

943.1 

995.2 

1,024.8 


.Dec. 

816.6 

846.1 

821.5 

778.2 

901.1 

944.7 

949.8 

999.7 

1,022.9 



The original data are now expressed as percentages of 
the corresponding moving averages. These percentages 
are given in Table 77. 


Table 77 


Percentage Relation of Actual Freight Car Loadings to \2-Month 
Moving Averages 


Month 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

Jan. 



90. 

1 

96, 

.4 

87. 

2 

88. 

,8 

92. 

6 

91, 

,8 

93 

.1 

91 

.6 

92 

6 

Feb. 



85. 

7 

90, 

,8 

86. 

0 

96. 

1 

91, 

,7 

97, 

.8 

93 

.8 

92 

.4 

94, 

2 

Mar. 



87. 

1 

98, 

.8 

88. 

,0 

103. 

1 

97. 

,1 

99 

,0 

95 

.3 

94 

.8 

98. 

6 

Apr. 



90, 

0 

84 

,9 

90. 

9 

89. 

6 

98. 

9 

94, 

.3 

95. 

,7 

94 

.8 

96 

3 

May 



94, 

,6 

99 

.7 

98 

9 

96. 

,0 

102. 

,0 

96, 

.5 

99, 

.5 

99 

.5 

100 

6 

June 



99 

.1 

103 

.2 

100. 

,2 

101. 

,0 

105. 

8 

97, 

.4 

101 

.2 

103 

.1 

103, 

0 

July 

112 

,7 

109 

,7 

104 

.4 

99. 

1 

97. 

3 

103. 

,0 

94 

.4 

99 

.1 

101 

.8 



Aug. 

ill 

,8 

109, 

,3 

113 

.4 

106. 

6 

100. 

,8 

108. 

,0 

103. 

,7 

108 

.8 

108 

.3 



Sept. 

112 

,2 

116, 

2 

114 

.6 

109 

,6 

108. 

3 

107. 

,9 

110, 

.9 

108 

.6 

111 

,4 



Oct. 

no 

.2 

116, 

.1 

120 

.1 

120. 

,5 

112, 

,7 

Ill, 

.6 

116, 

.4 

111 

.6 

115 

.5 



Nov. 

99 

,0 

96 

,3 

106 

.3 

98, 

6 

108 

,1 

101 

.2 

103, 

.5 

102 

.9 

101 

,7 



Dec. 

88 

.0 

89, 

.6 

91 

.9 

87, 

,5 

92 

.3 

87, 

.5 

91, 

.5 

92, 

.5 

83 

.9 




These percentages show some variation from year to 
year in the relation of the figures for a given month to 
the moving average. Thus the January figures, while 
always below the average, vary from 87.2 per cent to 
96 , 4 per cent of the average. The nine percentages secured 
for each month must be averaged to obtain the index 
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deared. Either the Mithmetic average or the median mav 
be employed for this purpoee. The reaulte secured by 
applying the two methods are shown in Table 78 Tn 
Mlumns (2) rad (3) the actual arithmetic means' and 
mediaiM are given The average of the twelve arithmetic 
memia happens to be exactly 100, so no further adjustment 

degree 

from 100 as It does for the medians. When this is the 
case, the twelve monthly index numbers must be adjusted 
to male then- average equal to 100. The items in column 
(4) have been secured from the items in column (3) bv 

dividing throughout by 1 . 00367. 


Table 78 


Indtm Of Seasonal Varialion in Freight Car Loadings, CmtmM 
from Moving Averages 


(1) 

.(2) 

3fanfh 

ArithmMic 


means 

January 

9L6 

Februaiy 

92.1 

March 

95.8 

April 

92.8 

May ' 

98.6 

June 

101.6 

July 

102.4 

August 

107.9 

September 

111.1 

October 

115.0 

November 

101.7 

December 

89*4 

Average 

100.0 


(3) 

(4) 

Medians 

Medians 

{unadjusted) 

(adjmted) 

91.8 

91.5 

92.4 

92.1 

97.1 

96.7 

94.3 

94.0 

99.5 

99.1 

101.2 

100.8 

101.8 

101.4 

108.6 

108.2 

110.9 

110.5 

115.5 

■ 115.1 

101.7 

101,3 

89.6 

■^89.3: 

100.367 

100.0 


Ta« Computation of Index Numbers of Seasonal 
Variation by Averaging Eatios to Trend 

securing seasonal indices, 
^ distmetive advantages, involves the 
ave^g of ratios to trend.* In the appUeation of this 

e essentials of this method were worked out independently by Helen D. 
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method, a suitable line of trend, linear or non-linear, is 
fitted to the data, the actual monthly items are expressed 
as percentages of the corresponding trend figures, and then 
for each month, an average of the percentage ratios of 
actual to the trend values is secured. This procedure 
is identical with that described in connection with the 
use of moving averages, except that the actual values may 
be expressed as percentages of normal values derived from 
any function employed to represent trend. In the selection 
of an average value for each month, use may be made of 
a multiple frequency table in obtaining an understanding 
of the nature of the actual seasonal movement. With the 
help of such a table the existence of a definite seasonal 
movement may be verified and the type of average to be 
used in securing a typical value for each month may be 
determined. (It would, of course, be equally appropriate 
to use a table of this type in connection with the method 
of movmg averages.) We shall apply this method to the 
data employed in the preceding examples. 

A^traight line, fitted to annual averages of the data of 
freight car loadings from 1918 to 1927, as given in Table 75, 
IS described by the equation 


y = 769.00 4- 23.727a: 


with origin at July 1, 1917. Normal values for each month 
may be computed readily.^ The normal value for the 
month centering at July 1, 1917, is 769.00 (i.e., the constant 
a ot the trend equation). Since the increment over a twehv- 
month period is 23.727, the increment from month to 
month IS one twelfth of this, or, 1.977. Hence the normal 
val ue for the m onth centering at January 1, 1918, is 769 . 00 


i^ootmte 1 continued frmn page 287.) 

Variation,” Journal of the Amcriain 
1924. 167-179. and Lincoln W. Hall, “Seasonal 
JuJJ,, 1924 ® ^^ican Statistical 

in Ctepte? Vll*^ “ determination of monthly trend values are discussed 
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4 . (6 X 1 • 977), or 780 . 862. But the average weekly freight 
ear loadings for January, 1918, must be taken to center 
at January 15th. The monthly trend value centering at 
that date is 780.862 + i(1.977), or 781.850. The trend 
value for February, 1918, is secured by adding to 781 . 850 


Relatives 

Jan. 

Feb. 

Mar. 

Apr. 

May 

June 

July 

Aug. 

Sept. 

Oct. 

Nov. 

Dec. 

122-123.9 







3 






120-121.9 








3 





118419.9 









I 

33 



116417.9 









3 

33 



114415.9 






3 


33 

33 

333 



112413.9 





3 



3 

3 




110 - 111.9 




3 


3 


\ 

33 




108-109.9 







33 

3 

3 

33 



106407.9 





\ 

3\ 

3 

3 


3 

3 


104405.9 







3 




3333 


102403.9 




\ 

33 

3 



3 


33 


100401.9 






33 

3 








98-99.9 


\ 

iUi 

3 









96-97.9 


\ i 


W 


3\ 



1 


3 


94-95.9 

U\\ 



\ 



3 





3 

92-93.9 


\ 




3 


j 





90-91.9 

\ 






3 




3 

'33 

88-89.9 






3 ‘ 







86-87.9 


\ 


\ 

\ i 


3 




3 


84-85.9 


I 










1 

82-83.9 

w 




! 








80-81.9 



1 










78-79.9 

1 












76-77.9 












!] 


Fis. 62 . — Frequency Distributions: Monthly Freight Car Loadings 
Expressed as Relatives of Corresponding Trend Values 


the monthly increment, 1.977. A similar process gives 
the value for each succeeding month. The results, rounded 
off to the nearest whole number, are given in column (2) 
of Table 80. 

Expressing each of the given values for each month as a per- 
centage of the corresponding trend value, we secure ten 
such relative figures (since the data cover ten years) . The 
ten January percentages vary from 79.4 to 98.9, the ten 
October percentages from 107 .0 to 119 . 7, etc. The multiple 
frequency table which appears in Fig. 62 is constructed 
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by classifying, in the form of a frequency distribution the 
items for each month. ' 

The presence of a distinct seasonal variation is dem 
onstrated by this table. Freight traffic is consistentlv 
low m the winter months. Activity is somewhat greater 
in the spring, and reaches a peak as a result of harvesting 
and other demands in the late summer and fall. 

The tabular sununary facilitates the selection of a type 
of average for the measurement of the seasonal movements 
ine median is likely to be unrepresentative; it is subject 
to material change in value by the addition or withdrawal 
ot one or two entries, unless there is a definite concentration 
in the monthly frequency distributions. The arithmetic 
mean of all the items, on the other hand, may be unduly 
attected by exceptional cases. An alternative method is 
provided by the possibility of taking the arithmetic mean 
of the central items for each month. If an inspection of 
the multiple frequency table does not lead to an immediate 
_ ecision as to which is the best type of average to employ 
m a given case, several index numbers may be worked 
Z! V ^ decision reached after a compari- 

determination of a 
a_ separate problem for each month, the 

tTmltkl 

is fTn ■ instance the seasonal variation 

^ fmrly regular, par after year. No great differences 
uld appear m the results secured by averaging varying 
numters of ite™ Index nmnbere ,Ld 
of the four cptpl items for each of the twelve months 

general, an average of 
and T *0 be stable 

tto L ?L greater the concentra- 

of Ifpm ' " i tables, the smaller the number 

ot items upp which the index numbers may be based.) 

i-ne simple averages of the four central items constitute 
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the unadjusted index numbers given in Table 79. Correcting 
these so that the average of each group is equal to 100, 
we secure the adjusted index numbers presented in the 
same table. (These averages have been derived, not from 
the frequency distributions shown in Fig. 62, but from 
individual percentages defining the relation of actual to 
trend values.) 

Tablk 79 


Indices of Seasonal Variation in Freight Car Loadings, Based 
upon Percentage Ratios of Actual Values to Linear 
Trend Values 



Unadjusted 

Adjusted 


index numbers 

index numbers 

Month 

(based upon- four 

{based upon four 


central items) 

central items) 

January 

92.9 

91.6 

February 

94.8 

93.5 

March 

98.6 

97.2 

April 

94.3 

93.0 

May . • 

100.2 

98.8 

June 

102.4 

101.0 

July 

102.8 

101.3 

August 

109.7 

108.2 

September 

112.3 

110.7 

October 

115.6 

114.0 

November 

103.6 

102.1 

December 

89.9 

88.6 

Average 

101.425 

100.0 


The index numbers of seasonal variation derived from 
ratios to trend accord very closely with those computed 
from moving averages. The widest discrepancy, for the 
month of February, amounts to only 1.4. The consistency 
of the seasonal movement in freight car loadings helps 
to explain this close agreement. In general the two methods 
here exemplified will yield results that are fairly close 
together. Both are well adapted to the measurement of 
seasonal changes in homogeneous series. Simpler methods 
may be used on occasion, and more involved methods may 
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be required in dealing with non-homogeneous series where 
there is reason to suspect that the pattern of seasonal 
movements has been modified during the period under 
observation. 

Modifications of these general procedures are necessary 
when the pattern of the seasonal movements m a given 
series is altered during the period under observation. Two 
types of shifts in seasonal variation may be distinguished 
The first includes shifts that are irregular over time, but 
that are related to definable causal factors. Thus the price 
of an agricultural product may follow one seasonal pattern 
in years of high production, and quite a different pattern 
in years of low production.* Where this condition prevails 
it may be possible to compute two sets of seasonal indices, 
each to be applied under appropriate conditions. Methods 
already described may be used in the construction of such 
indices. Of this irregular type, also, are alterations in 
the seasonal pattern of an economic series that reflect sharp 
changes in business practice. Shifts in the dates of the 
annual automobile shows in the United States have mater- 
ially altered the seasonal index of automobile sales. 

The second type of seasonal modification is progressive 
in character. The change in pattern is not sudden, nor 
does it reverse itself. Slow alterations over time in trade 
practices and consumption habits bring such evolutionary 
or secular changes. The slow displacement of the open 
ear by the closed car brought such a progressive modification 
in the seasonal pattern of automobile sales. In the computa- 
tion of seasonal indices under these conditions persistent 
changes over time in the figures for each month may be 
measured separately. Thus, when ratios to trend have 
been obtained, all the January items (such as those plotted 
in Fig. 62) may be plotted chronologically. The progressive 
change in the January relatives from 1920 to 1937, say, 
is then defined by a line of secular trend. The trend value 

^ Bae F, L, Thompson, AgricuMuml Prices^ New York, McGraw-Hill, 1936. 
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for January of 1920 is a first approximation to the January 
seasonal index for 1920. The figure for February of 1920 
is obtained in the same way, and so for each month of 
1920. Adjustment of these preliminary values to make 
their average equal to 100 gives a set of seasonal indices 
for 1920. Seasonal indices for other years are computed 
in the same way. 

This method is, of course, more laborious than the pro- 
cedure followed when the seasonal pattern remains con- 
stant. Before applying the more complicated method the 
investigator should assure himself that the shift in pat- 
tern is real, and not merely a reflection of accidental varia- 
tions.* 

The Measukement op Cyclical Fluctuationts 

There remains the task of combining the corrections for 
secular trend and seasonal variation in order to secure 
measures of cyclical changes in a given series. Major 
interest in most economic studies attaches to these cyclical 
changes, and the measurement of such changes is usually 
the central problem in the analysis of time series. The 
complete elimination of all non-cyclical movements is impos- 
sible, of course. We must content ourselves with measures 
reflecting cyclical changes intermingled in rather uncertain 
proportions with accidental fluctuations. 

The procedure may be illustrated with reference to the 
data of freight car loadings in the United States, presented 
in Table 75. For the purposes of the present illustration 
the study will be restricted to the decade 1918-1927. The 

‘ Tests of sampling errors are discussed in Chapters XIV, XV, and XVIII. 
The test of a linear trend in this case would relate to the slope b of the line 
fitted to the relatives for a given month. 

The literature on the measurement of seasonal fluctuations is extensive. 
The references at the close of this chapter contain detailed aceoimts of various 
modifications of the basic procedures discussed above. A rapid, flexible 
and accurate graphic method, suitable for use by the student who has grasped 
the essentials of the formal procedures, is explained in the article by WiUiam A. 
Spun. Spurr’s method utilizes relative (logarithmic) deviations, a procedure 
for which there is strong logical justification. 
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severe disturbances that occurred during the business cycle 
that ran its course between 1927 and 1933, and in the years 
immediately following, greatly complicate the task of disen- 
tangling the secular, seasonal, and cyclical elements in the 
behavior of this series. Not until a somewhat longer period 
has intervened wiU it be possible to determine the contribu- 
tions a changing secular trend and changing seasonal move- 
ments may have made to the fluctuations in railway freight 
traffic during the decade 1927-1937. 

In attempting to separate the results of secular, seasonal 
cyclical, and random movements in the behavior of time 
series, it is well to establish a series of '‘expected” values 
representing results of the operation of regularly acting 
torces. Most regular and predictable of the forces affecting 
time series are those defined as secular and seasonal. The 
equation to the line of secular trend of freight car loadings 
provides a means of estimating annual and monthly values 
These would be the “expected” values were the forces of 
trend alone in operation. But we know that a seasonal 
movement, regular enough for fairly exact measurement, is 
superimposed upon the trend. The combination of the 
results of these two forces provides a basic series of “expected 
values, from which deviations due to the play of other 
lorces may conveniently be measured. 

T ^ purpose is illustrated in Table 80. 

in col. (2) we have the monthly trend values of freight 

of 

t “ products of the two, constituting the series 

of ejected values,” are given in col. (4). Thus, for Janu- 
w expected number of freight ears loaded is 

not 782, the trend value, but 782 x .916, the latter figure 
being the seasonal index for January. This correction 
gives an expected” number of 716. Subtracting from the 
actual values m col. (5) the corresponding expected values, 
e obtain the measurements in col. (6). The 655 care 
loaded in January, 1918, fell short by 01 of the “expected” 


cyclical fluctuations 

Table 80 

Illustrati7ig the Analysis of a Series in Time 
Freight Car Loadings, 1918-1927 

I /iPt\ I . 


Seasonal I ' 
Miffa leotreciedi 

(as ratio) I 

seasoned 


1918 


Jan. 782 

.916 

Feb. 784 

.935 

Mar. 786 

.972 

Apr. 788 

.9.30 

May 790 

.988 

June 792 

1.010 

July 794 

1.013 

Aug. 796 

1.082 

Sept. 798 

1.107 

Oct. 800 

1.140 

Nov. 802 

1.021 

Dec. 804 

.886 


Jan. 
Feb. 
Mar. 
Apr. 
May 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. I 
Dee. 
1920 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 


810 . 972 

812 . 930 

813 . 988 

815 1,010 

817 1.013 

819 1 . 082 

821 1 , 107 

823 1 . 140 

825 1.021 


829 .916 

831 . 935 

833 . 972 

835 . 930 

837 I .988 

I 839 1 1.010 


C6) 

Deviatmi of 
actml value 
■ from ‘tre7vi j 
corrected for 
seasoTtal ’ I 


- 61 
+ 20 
+ 78 
+ 140 
+ 116 
+ 118 
+ 166 
+ 101 
+ 73 
I + 13 
0 

+ 7 

- 10 
- 68 

- 91 

- 34 

- 44 

- 27 
+ 59 
+ 6 
+ 51 
+ 29 

- 35 
+ 25 

+ 61 

- 1 
+ 38 
- 47 
+ 35 
+ 49 


(7) 

Peremfoge 
deviedion of 
actual value 
I from ‘trend 
I corrected for 
' seasonal ’ 
A~rs 
TS 

- 8.5 
+ 2,7 
+ 10.2 
+ 19.1 
+ 14.9 
+ 14,8 
+ 20,6 
+ 11.7 
+ 8.3 
+ 1.4 
0 

+ 0.98 


+ 0.7 
+ 5.6 
+ 3.1 

- 4,2 
+ 3.4 

+ 8.0 

- 0.1 
+ 4.7 
- 6.0 
+ 4.2 
+ 5.8 
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Table 80 — Continued 
Illustrating the Analysis of a Series in Time 


(1) 

(2) 

(3) 

(4) 


T 

S 

TS 

1920 




July 

841 

1.013 

852 

Aug. 

843 

1.082 

912 

Sept. 

845 

1.107 

935 

Oct. 

847 

1.140 

966 

Nov. 

849 

1.021 

867 

Dec. 

851 

.886 

754 

1921 




Jan. 

853 

.916 

781 

Feb. 

855 

.935 

799 

Mar. 

857 

.972 

833 

Apr. 

859 

.930 

799 

May ■ 1 

861 

.988 

851 

June 

863 

1.010 

872 

July 

865 

1.013 

876 

Aug. 

867 

1.082 : 

938 

Sept. 

869 

1.107 

962 

Oct. 

871 

1.140 

993 

Nov. 

873 

1.021 

891 

Dec. 

875 

.886 

775 

1922 




Jan. 

877 

.916 

803 

Feb. 

879 

,935 

822 

Mar. 

881 

.972 

856 

Apr. 

883 

.930 

821 

May 

885 

.988 

874 

June 

887 

1.010 

896 

July 

889 

1.013 

901 

Aug. 

891 

1.082 

964 

Sept. 

893 

1.107 

989 

Oct. 

895 

1.140 

1,020 

Nov. 

897 

1.021 

916 

Dec. 

1923 

899 

.886 

797 

Jan. 

900 

.916 

824 

Feb, 

902 

.935 

843 

Mar. 

904 

.972 

879 

Apr. 

906 

.930 

843 

May 

908 

.988 

897 


(5) 

(6) 

( 7 ) 

A 

A ^ TS 

A - TS 


fs 

901 

-f 49 

' +■ 5.8' 

969 

+ 57 

+ 6.3 

967 

+ 32 

+ 3.4 

1,005 

+ 39 

+ '4.0 

884 

+ 17 

+ 2.0 

755 

+ 1 

+ 0.1 

706 

~ 75 

- 9.6 

685 

- 114 

-14.3 

691 

- 142 

“ 17.0 

706 

- 93 

-11.6 

760 

- 91 

- 10.7 

762 

~ 110 

- 12.6 

750 

- 126 

- 14.4 

810 

- 128 

- 13.6 

842 

- 120 

- 12.5 

932 

~ 61 

- 6.1 

764 

- 127 

-14.3 

681 

- 94 

- 12.1 

696 

- 107 

- 13.3 

757 

- 65 

- 7.9 

818 

- 38 

; - 4.4 

716 

105 

- 12.8 

776 

- 98 

- 11.2 

831 

- 65 

- 7.3 

813 

- 88 

- 9.8 

853 

. 111 

' - 11.5 

925 

— 64 

- 6.5 

978 

■- 

4,.l 

957 

+ ' 41 : 

+ 45 

832 

+ '35 

+ 4.4 

848 

+ 24 

+ 2.9 

854 

+ 11 

+ L3 

916 

+ 37 

: + 4.2 

941 

+ 98 

M'+ 11.6 

975 

■:/. +■■.:... ■78r": 

+ 8.7 


1923 
' June 
July 
Aug, 
Sept. 
Oct, 
Nov. 
Dec, 
1924 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 
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Table 80 — Continued 

the Analysis of a Series in Time 

® (5) (6) I 

^ ^ A - TS 

rs 


910 1.010 

912 1.013 


July 

936 

Aug. 

938 

Sept. 

940 

Oct. 

942 

Nov. 

944 

Dec. 

946 

1925 


Jan. 

948 

Feb. 

950 

Mar. 

952 

Apr. 

954 

May 

956 

June 

958 

July 

960 

Aug. 

962 

Sept. 

964 

Oct. 

966 

Nov, 

968 

Dec, 

970 



Jan. 

972 

Feb. 

974, 

Mar. 

976 

Apr. 

978 

May j 

980 


.916 

.935 

.972 

.930 

.988 

1.010 

1.013 

1.082 

1.107 

1.140 

1.021 

.886 


I 846 
866 
902 
865 
921 
943 
948 
1,015 
1,041 
1,074 
964 
838 


859 

908 

916 

874 

895 

906 

881 

969 

1,037 

1,091 

976 

869 

891 
906 
926 
932 
971 
I 992 
975 

1.073 

1.074 
1,107 
1,024 

925 


+ 93 
+ 61 
+ 53 
+ 23 
+ 23 
+ 25 
+ 10 

+ 13 
+ 42 
+ 14 
+ 9 

- 26 

- 37 

- 67 

- 46 

- 4 
+ 17 
+ 12 
+ 31 

+ 23 
+ 18 


+ 26 
+ 24 


+ 6 
+ 36 
+ 66 

+ 30 
+ 21 
+ 11 
+ 56 
+ 50 


+ 10.1 
+ 6.6 
+ 5.4 
+ 2.3 
+ 2.2 
+ 2.7 
+ 1.2 

+ 1.5 
+ 4.8 
+ 1.6 
+ 1.0 
- 2.8 

- 3.9 

- 7.1 

- 4.5 

- 0.4 
+ 1.6 
+ 1.2 
+ 3.7 

+ 2.6 
+ 2.0 
+ 0.1 
+ 5.1 
+ 2.8 
+ 2.5 
+ 0.3 
+ 3.1 
+ 0.7 
+ 0.5 
+ 3.6 
+ 7.7 

+ 3.4 
+ 2.3 
+ 1.2 
+ 6.2 
+ 5 2 
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( 1 ) 


1926 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 

1927 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 
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Illuatratiiig the A7ialysis of a Series in Time 


(2) 

T 

982 

984 

986 

987 
989 
991 
993 

995 

997 

999 

1,001 

1,003 

1,005 

1,007 

1,009 

1,011 

1,013 

1,015 

1,017 


(3) 

S 

1.010 

1.013 

1.082 

1.107 

1.140 

1.021 

.886 

.916 

.935 

.972 

.930 

.988 

1.010 

1.013 

1.082 

1.107 

1.140 

1.021 

.886 


(4) 

TS 

992 

997 

1,067 

1,093 

1,127 

1,012 

880 

911 

932 

971 

931 

991 

1,015 

1,020 

1,092 

1,119 

1,155 

1,036 

901 


( 5 ) 

A 

1,052 

1,037 

1,106 

1,140 

1,184 

1,042 

858 

944 

956 

998 

969 

1,004 

1,021 

978 

1,073 

1,093 

1,101 

926 

814 


( 6 ) 

A - TS 


+ 

+ 

4“ 

+ 

+ 

+ 


+ 

+ 

+ 

+ 

+ 

+ 


60 

40 

39 

47 

57 

30 

22 

33 

24 

27 

38 

13 

6 

42 

19 

26 

54 

110 

87 


( 7 ) 

a.'-~:t8 

TS : : 

6.0 
4 . 0 : 

3.7 

4.3 ■ 
5.1 

3.0 . 
2 . 5 ;. ;■ 

3.6 
2 . 6 : , 

2.8 ' 

4.1 

1.3 
0.6 
4.1 
1 . 7 .: 

2.3 ■ . 

4.7 
10.6 

9.7 


+ 

+ 

+ 

+ 

+ 


4 - 

+ 

+ 

+ 

+ 

+ 


number, 716. Such deviations of actual values from “trend 
corrected for seasonal” represent the combined influence 
of cychcal and accidental factors. These may be utilized 
m the absolute form given in col. (6), or may be expressed 
m percentage terms as in col. (7) of Table 80. 

The^ series defining trend values corrected for seasonal 
variations, which are given in cols. (6) and (7) of Table 80, 
urnish the most satisfactory bases from which cycles in 
economic series may be measured. It is true that the 
cycles^ in cols. (6) and (7) are distorted by accidental 
uctuations, but there is no simple means by which these 
may e eliminated. Recognizing their presence, the series 
may be put to fruitful use in the study of cyclical movemente. ‘ 

iiiir deviations from trend ” may be secured by subtract- 

o seasonal variation from a series in which actual values are 
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The analysis of this series may be followed through 
graphically in Figs. 63 and 64 on page 300. The actual data of 
freight car loadings, by months from 1918 to 1927, are plotted 
in Fig. 63, together with a straight line of trend. In 
addition, a series of expected values (the figures in col. [4] 
of Table 80) is given for comparison with the actual. In 
this chart the seasonal pattern, shown by the dotted line, is 
superimposed upon the trend. Fig. 64 shows the deviations 
of actual from expected values, in percentage terms. These 
constitute the “cycles” in freight car loadings. As we have 
noted, random elements as well as cyclical fluctuations proper 
are present in these deviations. It would be possible, by 
using three- or five-month moving averages on these devia- 
tions, or by other smoothing processes, to eliminate some of 
the effects of the accidental movements. But the random and 
the cyclical movements are so closely interwoven that the 
attempt at separation is not generally made. 

If cyclical changes in this series are to be compared with 
similar changes in other series, it is desirable to reduce the fig- 
ures to a form permitting such comparison. The percentage 
deviations might be much more violent in one series than in 
another, and without a common denominator comparison 
would be difficult. This common denominator is afforded by 
the standard deviation. The monthly or annual deviations 
may be expressed in terms of the standard deviation as the 
unit of measurement, if such comparison is to be made. 


(Footnote 1 continued from page B98.) 

A 

given EvS percentages of corresponding trend values. That is, ^ ^ 


, , ^ 'pjg . , 

employed, instead of — — — This usage, which involves the assumptions 
lb 

that the cyclicai-accidentaF^ composite and seasonal variations both repre- 
sent deviations from trend as base and that their influences are additive, is 
not as strong, logically, as the method exemplified in the text. Trend and 
seasonal forces are the constant factors in the behavior of time series. In 
combination they may be thought of as providing the base from which cycli- 
cal and accidental movements occur, as deviations. (This is a convenient, and 
perhaps not a faulty, conception. We do not, however, possess knowledge of 
the true organic relations among' the elements of time series.): 
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The pr^eas of analysis has now been completed. We 
haye, for the given senes, the equation to the line of secular 
tend, and from this the normal or trend value at any 
given date may be imputed. The seasonal variations have 
ten me^ured, ^d indices of these variations computed. 
Fm^y, the cychea fluctuations (plus the unmeasLble 
random and accidental changes) have been isolated These 
incasitements of cycUcal fluctuations may be used te 
Studying the sequences of change in different econom^^^ 
senes durmg b™ cycles, in comparing economic seri s 
in respect of the amphtude or duration of their cvclical 
movements, and in various other ways in the analysis of 
business cycles and the planning of business operation 
borne of these apphcations are discussed in later sections. 

General Considerations 

specifically mentioned above 
should be borne m mind in subjecting time series to the 
t^e of analysis described in this chapter. It is essential 
that the data employed be homogeneous, as regards sources 
methods of quotation, coverage, etc. In addition, homo- 
geneity in the conditions underlying the behavior of the 
particular series which are the objects of study is assumed 
Homogeneity, aa the temi ia here used, may not be dcttied 
m atelnte terms. New faotoia are constantly being inter- 

“a”*" Homogeneity « 

tdSv a^e „ “■'‘‘■tons. Yet the change must 

continuous if tb movements, reasonably 

LSr Tb ^ *'»^od ia to yield 

rarita. Abrupt dislocations that suddenly alter prevailing 

bZteSvTT Moessary 
te dSlocL 

•emart TbreT “ol- 

io •naiysis in its\“n°” 

For the determination of a line of trend and the calcula- 
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tion of indices of seasonal variation, data extending over 
as long a period as possible should be employed (subject 
to the preceding qualification regarding fundamental dis- 
continuities). Ten years may be suggested as a minimuin 
period, though a much longer term of years is desirable. 
If interest attaches to cycles of long duration, rather than 
to the short-period business cycles with which the preceding 
account is concerned, our concept of trend, as well as that 
of cycles, must be modified. The minimum time period 
suitable for study must be correspondingly lengthened. 

If a relatively short term of years is employed in the 
determination of trend, it is important that the terminal 
years be neither exceptionally high nor exceptionally low, 
as a result of cyclical or accidental movements. In general, 
the cyclical movements in the terminal years should be 
in “symmetrical phases,” in Crum’s phrase. Thus a cyclical 
rise at the beginning of the period should be balanced by 
a cyclical decline at the end. 

It is logically improper to make correction for assumed 
seasonal movements in a time series unless the existence 
of true seasonal variations has been established. That is, 
a test should be applied to determine whether the observed 
departures of the various monthly indices from their aver- 
age value (100) are attributable to the play of chance, or 
whether a true seasonal pattern is present. The basis of 
such tests of significance is discussed in Chapters XIV 
and XVIII, and a method appropriate to the present problem 
is developed in Chapter XV. 

In fitting a line of trend, computing indices of seasonal 
variation and deriving, finally, a set of residual figures 
which are taken to reflect the cyclical fluctuations in an 
economic series we are, of course, abstracting from reality. 
AS in all such abstractions, caution is necessary. Assump- 
tions implicit in the various steps taken are likely to be 
forgotten. Thus the “cycles” plotted as deviations in Fig. 64 
are distorted not only by the random and irregular fluctua- 
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tions to which attention has already been called. To the 
extent that the trend is inadequately or inaccurately defined 
by the particular function used, residual errors are present 
in the deviations. To the extent that seasonal movements 
are inaccurately measured by the seasonal indices employed, 
other residual errors are present. And if the trend is pro- 
jected beyond the period covered by the fitting process, 
or if seasonal indices are used for periods not included in 
their calculation, new sources of possible error are intro- 
duced. The “cycles” that appear so definite and clear-cut 
in our tables may contain more than traces of many non- 
cyclieal elements. It is often desirable to employ methods 
of analysis that carry us far from the original observations, 
but the dangers of misinterpretation and error are multiplied 
as we abstract from the reality of economic processes and 
business operations. 

The methods of time series analysis described in this 
and the preceding chapter are adapted to a variety of 
economic and business purposes. But they do not constitute 
the only means of attack, in dealing with series ordered 
in time. Special problems may necessitate the use of some- 
what more elaborate procedures.^ For some purposes simpler 
methods will suffice. For other purposes it may be invalid 
to attempt to isolate and measure separately the influence 
of secular, seasonal, and cyclical forces. Economic science 
has yet to determine the precise nature of the interrelations 
among these categories of forces. In the light of this fact 
the discerning statistician will adapt his methods to the 
requirements of individual problems, as they arise. 
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CHAPTER IX 


INDEX NUMBERS OF PHYSICAL VOLUME 

Comprehensive and accurate records of physical pro- 
duction are of central importance to business interests, to 
government, and to economists. The appraisal of the mar- 
ket and the intelligent planning of production programs 
require knowledge of past production trends and present 
conditions. The credit policies of banking authorities and 
monetary policies of federal agencies are determined in 
good part with reference to the physical volume of goods 
being produced and marketed. The phases of business 
cycles are probably traced with more accuracy by produc- 
tion movements than by changes in any other economic 
element. The directions in which the productive efforts 
of an economy are being exerted are defined by records of 
the output of goods of different classes, such as capital 
goods and consumption goods. Changes in the course of 
years in the true standard of living of a nation must be 
measured in terms of the aggregate of physical goods pro- 
duced. 

The last twenty years have witnessed notable enlarge- 
ments of the scope and improvements in the accuracy of 
measurements of production in the Urdted States. Efforts 
of federal agencies, private organizations, and trade associa- 
tions have combined to provide materially better statistics 
of output in agriculture, mining, and manufacture. More 
recently records of the volume of trade have been broadened 
and made more accurate. There are important gaps still, 
particularly as regards the output of finished, highly fabri- 
cated goods not easily enumerated in units of constant 
quality. But the statistics we havA 
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reasonably accurate record of monthly and armual move- 
ments of production. 

Here, again, we face the problem of combining series 
relating to individual commodities. For scattered data on 
the output of oats, coal, gasoline, pig iron, automobiles, 
etc. do not define the general changes in output that are 
of interest to persons concerned with the larger aspects 
of economic change. He who would study the course of 
general production encounters a problem much like that 
presented to the student of general price movements. If 
the general trend of production is to be determined, or 
if the cyclical or seasonal swings of production are to be 
studied, the mass of individual figures must be reduced to 
the form of a single index, the significance of which may 
be easily comprehended. The present chapter deals with 
methods appropriate to the construction of such indices. 

Index Numbers of Production Unadjusted for Trend 
AND Seasonal Movements 

An immediate and obvious obstacle to the combination 
of measures of output for different industries arises from 
differences in the units employed. Since bushels, tons, and 
gallons may not be added directly, the simple aggregative 
type of index is ruled out. One method of overcoming this 
difficulty is to reduce to relative terms the several output 
series that are to be combined. A relative number measuring 
the output of petroleum in 1936 as a percentage of output 
in 1922 may be averaged with similar relatives for bituminous 
and anthracite coal. The average may be a simple one, 
or the relatives for the several commodities may be weighted 
m proportion to the importance of the commodities in 
question. This procedure was illustrated in detail in the 
opening pages of Chapter Vi. 

An alternative method is to employ an index of the 
weighted aggregative type, keeping quantities constant as 
between two periods being compared. In 1917, according 
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to the computations of the Price Section of the War Indus- 
tries Board, the total value of the output of 90 raw materials 
in the United States was 34,748 millions of dollars. This 
figure represents, of course, a value total of the tjpe 
S(gi 9 i 7 Pi 9 i 7 ) where qim represents the quantity of a given 
raw material produced in 1917 and pmi represents the 
average price of that commodity in 1917. In 1918 both 
quantities and prices were different. If, however, we obtain 
another value aggregate using 1918 quantities and 1917 
prices we shall have a figure differing from that for 1917 
only in respect of the quantity factor. For the 90 raw mate- 
rials in question this total, which is represented by the 
expression S(gi 9 i 8 Pi 9 i 7 ) amounted to $35,169,000,000. The 
totals for 1918 and 1917 are comparable, being both in 
dollar units. The difference between them measures the 
change in physical production between the two years. As 
an index of this change we have 


j S(g'i9i8pi9i7) _ $35,169 

$34,748 


101.2 


This index will be recognized as one of the aggregative 
types discussed in Chapter VI, except that the p’s and the 
q’s are interchanged. When information concerning both 
quantities produced and average per unit prices is available, 
these aggregative indices, or the “ideal” index which is 
a combination of two such aggregative measures, may be 
employed for the measurement of quantity changes as well 
as for price changes. The “ideal” index, when used for 
this purpose, takes the form 


7 = I / S(giPo) 

¥ S(gopo) 


w S(gipi) 
S(gopi) 


where qo and po represent the quantities and prices of the 
individual commodities in the base year, while qi and 7 J 1 
represent quantities and prices in the given year. The 
procedure in the computation of such an index is identical 
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with that employed in computing the “ideal” price index, 
with prices and quantities reversed. This formula may be 
modified, as was the corresponding price index, to 

S(pi) + Pi)gi 
S(po + Pi)9o 

or to a form in which the p’s come from some intermediate 
year. In one form or another, the aggi’egative type of 



Fig. 06 . — Changes in the Physical Volume of Manufacturing Production 
in the United States, 1914-1935. All Commodities, Capital Goods and 
Consumption Goods 

index is well adapted to the requirements of an index of 
physical volume.* ■ 

The aggregative procedure lends itself readily to the con- 

^ Shiee the pri(H* or value factor enters in the tierivation of such an index, 
whether it lie constructed from relative numbers or from value aggregates, 
no epantity Index is completely divorced from pecuniary measurements. For 
a discussion of this point, and of other logical problems involved in the con- 
Btraction of Index numbers of production, see Arthur F. Burns, “The Measure- 
ment of the Pliysieal Volume of Production/^ Qmrterhj Journal of McommicSf 
February, 1930 . 
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struction of index numbers for commodity groups. This 
is desirable in the study of production movements, as it 
is for prices. The significant features of production changes 
over a given period may be far more clearly revealed by 
measurements of relative changes in the output of different 
classes of goods than by a general index of production. 

Changes in the volume of production of various classes 
of manufactured goods during the period 1914-1935 are 
indicated by the following measurements, constructed by 
the National Bureau of Economic Research. The basic 
data, which were compiled by the Census of Mairufactures, 
provide the quantity and (by derivation) the unit price 
records required for the “ideal” formula. That foraiula, 
slightly modified for working purposes, was employed in 
the construction of these index numbers. 

Table 81 


Index Numbers of the Physical Volume, of Production of 
Manufactured Goods in the United States, 1914-1935 ‘ 



All 

industries 

Durable 

goods 

Serni- 

durable 

goods 

Perish- 

able 

goods 

Goods 
destined 
for human 
consump- 
tion 

Goods 
destined 
for capital 
equipment 
and con- 






struction 

1914 

100.0 ! 

100.0 

100.0 

100.0 

100.0 

100.0 

1919 

129.5 1 

141.7 

120.9 

123.2 1 

129.1 

129.5 : 

1921 

104.5 

99.6 

104.6 

108.9 

109.4 

91.8 . 

1923 

155.8 

183.7 ■ 

140.2 

135.4 

150.4 

164 . 3 , 

1925 

159.5 

185.2 

141.8 

144.4 

154.0 

167 ., 7 .. 

1927 

163.3 

177.2 

151.0 

164.9 

159.5 

166.. 7 . 

1929 

183.7 

210.9 

162.5 

170.9 

177.7 

; 192.0 . 

1931 

1 138.2 

112.3 

137.4 

154.9 

146.9 

103 ., 7 : 

1933 

' 128.0 

91.4 

140.1 

144.4 

. 142 .. 6 ■ 


1935 

160.5 

143.9 

164.4 

163.9 

171.3 

122.5 


Selected measurements from Table 81 are shown graphi- 
cally in Fig. 65. 

“Constructed by the National Bureau of Economic Research, New York. 
See Economic Tendencies in the Uniied States for a statement on procedure. 
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Adjusted Index Numbers of Production 

In the analysis of time series we have seen that cyclical 
fluctuations are often the objects of primary interest. This 
is particularly true in the study of physical volume, for 
changes in the volume of production and trade are features 
of fundamental importance in business cycles. Methods 
have been explained, in the preceding chapters, by means 
of which we seek to measure the cyclical fluctuations in 
individual series (fluctuations which are inextricably entan- 
gled with accidental movements of major and minor degree). 
An obvious next step, in the study of general business 
conditions, is the combination of the cyclical-accidental 
movements in a number of series into a single index. The 
utility of such an index of changes in the physical volume 
of production in the course of the business cycle is evident. 

When annual data are employed the construction of an 
index of these cyclical changes is simple. No problem of 
seasonal variation enters, and secular trend alone has to 
be taken account of. Two different methods by which 
this may be done present themselves. Edmund E. Day, 
a pioneer in this field of economic research, has tested both 
methods. 

The first involves the fi.tting of an appropriate line of 
trend to each of the constituent series. The actual items 
are then expressed as percentages of the corresponding 
trend values. When this has been done for each series, 
the final adjusted index for a given year is obtained by 
taking a weighted average of these percentages for that 
year. Each commodity may be weighted in this averaging 
process, as in the calculation of the unadjusted index. 
The resulting adjusted index is in terms of relatives, but 
these relatives refer to a hypothetical “normal,” instead 
of to any fixed base. This is the desired index of cyclical- 
accidental changes in the physical volume of production. 
With monthly data the process is the same, except that. 
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before being averaged, the deviations from trend are cor- 
rected to eliminate the influence of recurrent seasonal 
movements. 

In the process of averaging deviations from trend, account 
should be taken of the relative variability of the series 
being combined. As an example, we may consider the 
combination of data of pig iron production and cattle 
receipts in a general index of production. Reducing pig 
iron production to terms of “seasonably adjusted deviations 
from trend,” we obtain a series mai’ked by rather extreme 
fluctuations. The standard deviation of this adjusted 
series, for a given period, was 27. For cattle receipts, cor- 
respondingly adjusted, the standard deviation was 11. In 
any combination of the two series of percentage deviations 
the more widely fluctuating pig iron measurements will exer- 
cise a dominant influence, unless correction is made. The use 
of weights defining the relative economic importance of the 
two series will not prevent distortion due to the greater 
variability of the pig ii’on series. 

One way out of the difiiculty is to divide the deviations 
from trend by the respective standard deviations, before 
averaging. This gives an index in standard deviation units. 
Another procedure involves the combination of the “eco- 
nomic weight ” and the standard deviation of each series 
in a weighting factor to be applied directly to the percentage 
deviations from trend. The economic weight is divided 
by the corresponding standard deviation, in making the 
combination. The method is illustrated below. 

Economic weight ' 

“T standard deviation 

The final weighting factors are the figures given in the last 
column. These may, of course, be rounded off when a 
number of series are to be averaged.* 

, ' ^ This useful, method of .combining 'econoinic weights and^ standard devifl.- 


Series, 

Pig iron production 
Cattle receipts 


Economic Standard 

weight detdaiion 


20 

4 


27 

11 
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The alternative method of combining economic series is 
simpler. Each unadjusted index possesses a trend which is 
“a composite of the persistent tendencies of the several 
original series upon which the unadjusted index is based.” 
It is possible to measure this trend, instead of the separate 
original trends, and secure the adjusted index directly from 
the unadjusted. Day’s results indicate that there is no 
loss of accuracy in the use of the simpler method. 

AN INDEX OF INDUSTRIAL ACTIVITY 

This procedure, with certain modifications, is well exempli- 
fied in an "Index of Industrial Activity in the United 
States,” constructed by the Chief Statistician’s Division 
of the American Telephone and Telegraph Company.^ The 
elements of this index are monthly data; seasonal corrections 
are therefore necessary. When these corrections have been 
made a general index measuring long-term growth and 
cyclical-accidental fluctuations, in combination, is con- 
structed by averaging 1 1 series, with appropriate weights.® 
This index is shown for the period 1899-1937, with line 
of trend, in Fig. 66. The trend line was fitted by least 
squares to data for the period 1899-1930, with the war 
years, 1917-1918, omitted. 

When each monthly value of the index is expressed as 
a percentage deviation from the corresponding trend value, 
the measurements presented in Table 82, and graphically 
portrayed in Fig. 67, are obtained. The cyclical-accidental 

tions has been employed by G, W. Starr, Director of the Bureau of Business 
Research of Indiana University. I am indebted to him for the example. 

^ This index has been constructed for the use of tlie staffs of the Bell system 
companies, and is not available for distribution. It is published here by cour- 
tesy of the American Telephone and Telegraph Company. 

® The foUowing scries w^ere used for the later years of the period covered : 
steel ingot production, pig iron production, automobile iiassenger car produc- 
tion, building contracts a^varded (on a square foot basis), cotton consumption, 
woo! consumption, slaughter of cattle and hogs, newsprint consumption, mis- 
cellaneous freight car loadings, electric power consumption, and employment 
in manufacturing industries. Since employment is included, tlie index goes 
slightly beyond the field of strict physical production. It is intended to be 
an index of industrial activity. 
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Weighting factors. For mineral products the weight for 
each commodity is its average per unit value in the base 
period. For manufactured products the weight for a given 
commodity is the per unit “value added” (i.e., added by- 
manufacture), modified to the extent that the commodity 
in question is taken to represent other manufactured prod- 
ucts not directly included in the index. These “weights” 

thus correspond to p’s in the aggregative formula ^ , 

S(?o?)o) 

except that for a manufactured product the p is a “price” 



Fig. 68. — Physical Volume of Industrial Production in the United States, 
1919-1937 (1923-1925 average = 100) 


for the ser-viees of agents of fabrication, with a modification 
to allow the given commodity to represent similar products 
for which quantity data are not available. The weights 
for manufactured goods were drawn from the Census of 
Manufactures for 1923. The po used to weight the q for 
manufactures is thus not strictly a base-period price.' 

Adjustment for seasonal variation. No correction for trend 
is made, but in one form of the index an adjustment is 
made to eliminate the effect of seasonal fluctuations in the 

^Weighting factors were modified' for ■■ the .period 1919-1922 by the eom» 
bination of weights for 1919 with those for the base period. 
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Table 83 


Index of Industrial Production, Board of Governors of the Federal 
Reserve System, 1919-1937 
(Adjusted for seasonal variation. 1923-1925 average = 100) 


Month 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

1928 

Jan. 

82 

95 

67 

73 

99 

100 

105 

106 

107 

107 

Feb. 

79 

95 

66 

76 

100 

102 

104 

105 

108 

109 

March 

76 

93 

64 

80 

103 

100 

103 

106 

110 

108 

April 

78 

88 

64 

77 

106 

95 

102 

107 

108 

108 

May 

78 

90 

66 

81 

106 

89 

102 

106 

109 

108. 

June 

83 

91 

65 

85 

106 

85 

102 

108 

107 

108 

July 

87 

89 

65 

85 

104 

84 

103 

108 

106 

109 

Aug, 

89 

89 

67 

83 

103 

89 

103 

110 

106 

110 

Sept. 

87 

86 

68 

88 

100 

94 

101 

111 

104 

113 

Oct. 

86 

83 

71 

93 

99 

95 

104 

111 

102 

115 

Nov. 

85 

76 

71 

97 

98 

97 

107 

110 

101 

117 

Dec. 

86 

72 

70 

100 

97 

101 

109 

107 

102 

118 

Annual 

index 

83 

87 

67 

85 

101 

95 

104 

108 

106 

111 


Month 

1929 

1930 

1931 

1932 

1933 

1934 

1935 

1936 

1937 

Jan. 

119 

106 

83 

72 

65 

78 

90 

97 

114 

Feb. 

118 

107 

86 

69 

63 

81 

89 

94 

116 

March 

118 

103 

87 

67 

59 

84 

88 

93 

118 

April 

121 

104 

88 

63 

66 

85 

86 

101 

118 

May 

122 

102 

87 

60 

78 

86 

85 

101 

118 

June 

125 

98 

83 

59 

91 

83 

87 

104 

114 

July 

124 

93 

82 

58 

100 

76 

86 

108 

114 

Aug. 

121 

90 

78 

60 

91 

73 

88 

108 

■ 117 

Sept. 

121 

90 

76 

66 

84 

71 

91 

109 

111 

Oct. 

118 

88 

73 

67 

76 

73 

95 

110 

. 102 

Nov. 

110 

86 

73 

65 

■ 72 

74 

96 

114 

■::,.88 

Dec. 

103 

84 

74 

66 

75 

86 

101 

' 121 

; 84, 

Annual 

index 

119 

96 

81 

64 

76 

79 

90 

105 

liO 


output of individual commodities. Seasonal indices were 
computed by averaging the ratios of actual data to twelve- 
month moving averages. (See Chapter VIII.) Where there 
was evidence of progiessive change in the seasonal pattern, 
the seasonal adjustments for a gwen commodity were 
modified from year to year. The actual adjustment for 
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seasonal change is made by dividing the daily average 
output of a given commodity in a stated month by the 
seasonal index for that month, expressed as a ratio (i.e., 
as 1 . 10, if the conventional index were 110). The seasonally 
adjusted q would thus be reduced if the seasonal index 
were above 1.00, raised if the seasonal index were below 
1.00. In the construction of the seasonally corrected 
index, these adjusted g’s are used in the aggregative formula 
previously described.^ 

Monthly values of this index are given in Table 83, for 
the period 1919-1937. The index is shown graphically in 
Fig. 68 on page 317. 

Dbeivbd Indicbs op Peodtjction and Productivity 

It is possible, where suitable records of value of product 
and indices of price changes ai-e available, to derive an 
index of production by indirection. In the case of a single 
commodity it is obvious that pg ^ p = g. (Here g repre- 
sents the number of physical units produced, p represents 
average per unit price, and pg is the aggregate value.) 
A similar process is possible in handling statistics relating 
to a number of commodities, in combination. Indeed, the 
records may be in the form of relatives, or index numbers, 
covering a number of months or years. Division of a value 
index by a price index relating to the commodities included 
in the value index will yield an index measuring changes 
in physical output. 

This procedure may sometimes be used to obtain meas- 
urements that could not possibly be built up by combining 
a number of individual records. Whether the method is 
applicable in a given instance depends upon the compara- 
bility of the price and value index numbers. The strict 

A detailed .description : of .the constituents of this index and o.f the pro-, 
cedure employed in its construction is given in the Federal Reserve Bvlletm 
for February-y.: 1927. Revisions are. noted in thC' issues, of that Bulletin for 
March, 1932, Septij 1933, Nov., 1936, .and' March, .1937, The index appears 
in current issues of the . Federal Reserve BtMetin. 



320 INDEX NUMBERS OF VOLUME 


requirement that the price index relate to precisely the 
commodities included in the value index cannot generally 
be met. If we assume that a given price index is fairly 
representative of the commodities covered by the value 
records, and if the formula employed in the construction 
of the index is appropriate, the method may be justified 
as a means of approximating the required index of physical 
output. 

An example of such a procedure is furnished by the 
materials in Table 84. These illustrate a method used in 
deriving an index of production of manufactured goods. 
The indices in col. (3) are derived directly from the aggre- 
gate figures on “value added by manufacture.” The indices 
in col. (4) measure changes in average “value added” per 
unit, or cost of fabrication per unit, of manufactured goods. 
(This is, in effect, a price index, the price covering the 
services of manufacturing agents in the process of fabrica- 
tion.) This series of index numbers is based upon records 
available for a representative sample of manufacturing 
industries. The general index of manufacturing production, 

Table 84 

Illustrating the Derivation of Index Numbers of the Physical Volume 
of Manufacturing Production, 19^-1929 ^ 


(1) 

(2) , 

(3) 

(4) 

(5) 


Total 


Index of 

Derived 


value added, 

mine added 

index of 


aU census indnstries 

per unit of 

physical 

year 



product, 

volume 


(in 

(iu 

industries 

of 


miUiom 

rela* 

included in 

.production 


of dollars) 

tives) 

sample 

(3) - (4) 

1923 

25,850 

100.0 

100.0 

100.0 

1925 

26,778 

103.6 

97.3 

106.4 

1927 

27, ,585 

106.7 

92.4 

115.4 

1929 

31,844 

123.2 

96.8 

127.3 


‘ This table is taken from Economic Tendencies in the United /States, New 
York, National Bureau of Economic Research, 308. 
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relating to all industries, is derived by dividing the relatives 
for total "value added” by the index numbers measuring 
changes in “value added” per unit of product (with a 
suitable shift in the decimal point). 

The derived measurements given in col. (5) of Table 84 
are probably more accurate than index numbers based upon 
directly enumerated physical products. For the gaps in 
the coverage of the latter are serious. Limitations of coverage 
are the more serious in that the excluded industries are 
in many cases just the new, rapidly developing industries 
the output of which is growing most rapidly. 

A somewhat similar process of derivation is employed in 
the construction of measurements of industrial productivity. 
It is impossible, by direct observation, to compile records 
of output per man or per man-hour over any considerable 
area of industrial activity. However, given accurate indices 
of physical production and comparable records of number 
of workers employed or of man-hours worked, one may 
derive index numbers measuring changes in productivity. 

An example of this procedure is given in Table 85. The 
measurements given should be regarded as approximations 
only. 

Table 85 

Index Numbers of Physical Volume of Production, Man-Hours 
Worked and Output per Man-Hour, Manufacturing 
Industries of the United States, 1929-1935 



Physical volume 

Total lumber 

Estimated 

Year 

of manufacturing 

of nimi-hours 

output per 


production 

worked 

7nan-hour 

1929 

100 

100 

100 

1931 

75 . 

66 

114 

1933 

70 

60 

117 

1935 

. '87 

■ 70 

124 


Between 1929 and 1935 the total volume of manufacturing 
production declined 13 per eent. The number of man- 
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hours worked decreased by 30 per cent, however. The 
indicated gain in output per man-hour was 24 per cent. 

Measurements such as these are of unquestioned value 
to the student of industrial change, but their limitations 
should be clearly stated. The initial necessity of full com- 
parability between the output and employment records has 
been mentioned. Discrepancies here may lead to serious 
errors in the derived measurements. More difficult to 
detect are technical industrial changes that do not appear 
in the statistical records. Changes in the quality of the 
goods represented in the production index may lessen the 
accuracy of that index, and affect the productivity measure- 
ments. If employment is measured in terms of number of 
men employed, the resulting index of per capita output 
may be seriously distorted by changes in the length of the 
working week. Again, if only direct labor is enumerated 
in the emplo 3 Tnent index, a shift in technical methods that 
involves the use of a much larger proportion of indirect 
labor may lead to a great advance in apparent productivity, 
which far exceeds the real gain. Some of the gain that 
apparently follows the increased mechanization of a plant 
or a process is of this fictitious sort. Labor that precedes 
the direct act of production, and servicing and supervising 
labor, may have replaced direct labor. Failure to take 
accoimt of the contributions of these indirect applications 
of labor may lead to gro^ly exaggerated measures of 
productivity gains. 

The piu-pose of the preceding pages has been to exemplify 
procedures used in the measurement of changes in produc- 
tion, with incidental reference to related problems. While 
there is no one standard method, it will be clear that the 
construction of quantity index numbers requires no involved 
procedure. Certain special problems — of weighting, of 
measuring secular and seasonal movements, of ensuring 
comparability when methods of derivation are employed — 
have been noted. In addition, most of the problems that 
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bulk large in the construction of price index numbers are 
faced in this area also. The task of obtaining accurate, 
homogeneous series of basic data entails no less careful 
field work in production than in prices. Quality changes 
lessen the accuracy of both types of index numbers. Com- 
parisons over considerable time periods are rendered inaccu- 
rate by such quality changes and perhaps even more by 
changes in “regimen” — in the complex of tastes, consuming 
habits, and technical methods that determines the weighting 
factors used in the construction of index numbers. In spite 
of these difficulties substantial progress has been made in 
recent years in the improvement of measmes of industrial 
activity of the type discussed in this chapter. More com- 
prehensive and more accurate data are being compiled, 
and technical standards in the construction of index numbers 
are being steadily raised. These gains are contributing to 
a notable advance in our knowledge of economic processes. 


REFERENCES 

Bliss, Charles A., “Production in Depression and Recovery,” 
Bulletin 58, National Bureau of Economic Research, Nov. 15, 1935. 

Burns, Arthur F., “The Measurement of the Physical Volume of 
Production” Quarterly Journal of Economics, Feb., 1930. 

Burns, Arthur F., Production Trends in the United States Since 
1870. 

Day, E. E., “An Index of the Physical Volume of Production.” 
Review of Economic Statistics, S&pt. 

Day, E. E. and Thomas, W., The Growth of Manufactures, 
1899-1923. 

Leong, Y. S., “Indexes of the Physical Volume of Production of 
Producers’ Goods, Consumers’ Goods, Durable Goods and Tran- 
sient Goods.” Journal of the American Statistical Association, June, 
1935. ; ,, : 

Mills, F. C., Economic Tendencies in the United States, Appen- 
dices 1, 4. 

Mitchell, W. C., Business Cycles, The Problem and Its Setting, 
Chap. 3. 

Perry, F. G. and Silverman, A. G., “A Newindexof the Physical 



324 


INDEX NUMBERS OF VOLUME 


Volume of Canadian Business,” Journal of the American Statistical 

Association^ Jime^ 1929 . 

Persons, W. M., Forecasting Business Cycles. 

Snyder, Carl, Business Cycles and Business Measurements 
Cnaps. 2-5. 

Wemtraub, David, “Unemployment and Increasing Produc- 
bvity. In Technological Trends and National Policy, National 
Resourws Committee, June, 1937 (75th Congress, 1st Session 
House Document No. 360). ’ 



CHAPTER X 


THE MEASUREMENT OF RELATIONSHIP: LINEAR 
CORRELATION 

In discussing averages and measures of dispersion and 
skewness we have been dealing with methods of describing 
a single frequency distribution. The arrangement of the 
values of a single variable along a scale may be portrayed 
by means of these measures, which enable the central value 
to be determined and the character of the distribution 
about that central value to be described. In the analysis 
of time series a somewhat different problem has been faced. 
In such cases we are concerned with the changing values 
of a variable factor with the passage of time, and seek to 
determine the degree to which the changes in value are 
due to the play of different forces — the secular trend and 
cyclical, seasonal, and accidental factors. The preceding 
chapters dealt with methods by which we might measure 
the effect upon a given series of each of these factors (with 
the exception of accidental fluctuations). 

Certain of these methods are applicable to the problem 
now before us. It was found that in dealing with time 
series the relationship between time and the long term trend 
factor may be described by a definite mathematical equa- 
tion. That is, trend or growth seems to be a function of 
time for many economic series. Where such a relationship 
prevails, whether it hold precisely or only approximately, 
there is a distinct advantage in securing a mathematical 
expression which describes it. A similar but much broader 
problem is now to be discussed. If it is possible in dealing 
with time series to secure a definite mathematical equation 
for the relation between time and the normal values of the 
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items in a given series, cannot the same device be employed 
in studying the relationship between other variables? Can 
we not define, mathematically, the relation between cotton 
production and the price of cotton, between corn yield 
and rainfall, between earnings and the output of labor? 
If this can be done, it will place in the hands of the econo- 
mist a very powerful tool, giving his methods something 
of the precision which attaches to the work of the physical 
scientist. 

The Relation between Numbee of Taxable Peksonal 
Incomes AND Motoe Vehicle Registeation 

As a typical problem we may consider the relation between 
the number of taxable personal incomes and the number 
of passenger automobiles registered, by states in 1934. 
The figures are given in columns (2) and (3) of Table 86.* 

These figures are plotted in Fig. 69, each dot representing 
the relation between the number of taxable incomes and 
the number of registered passenger cars for a given state. 
Such a figure is termed a "scatter diagram.” It is clear 
from this diagram that there is a relationship between the 
two varmbles. In general, the states with a large number 
of taxable personal incomes are also those having a large 
number of motor vehicle registrations. The relationship, 
however, is not perfect. Two states with the same number 
of taxable incomes may differ quite widely in the number 
of registered vehicles. Thus both Rhode Island and Colorado 

i Nine states for each of which there were more than 100,000 individual 
income tax returns and more than 685,000 passenger cars registered in 1931; 
have not been included. The observations for these states, some of whicli are 
materially affected by the -presence ■ of important industrial centers, depart 
rather widely from those for the remaining states, and are marked by a func- 
tional relationship between personal incomes and motor vehicle ownership 
somewhat different from that prevailing for the country at large. The states 
thus excluded are New York, Pennsylvania, New Jersey, Illinois, Massachu- 
setts, Michigan, Texas, Ohio, and ^California. The state of Washington has 
also been excluded, since the income tax returns for that state are combined 
with those of Alaska, in the reports of the Bureau of Internal Revenue. The 
results are to be interpreted, of course, with these restrictions in mind. 



Table 86 

Taxable Personal Incomes and Passenger Aidomohih Registration in 
Thirty-Eight States, 1934 


( 1 ) 

( 2 ) 

( 3 ) 

(4) 

(5) 

(6). ' 


No, of taxable No, of passen- 





personal in- 

ger cars Teg-- 




State 

mnes, 1934 

isieredy 1934 





(thousands) 

(thousands) 





X 

Y 

XY 



Alabama 

23 

192 

4,416 

529 

36,864 

Arizona 

11 

80 

880 

121 

6,400 

Arkansas 

13 

162 

2,106 

169 

26,244 

Colorado 

31 

246 

7,626 

961 

60,516 

Connecticut 

91 

310 

28,210 

8,281 

96,100 

Delaware 

11 

45 

495 

121 

2,025 

Florida 

33 

280 

9,240 

1,089 

78,400 

Georgia 

38 

317 

12,046 

1,444 

100,489 

Idaho 

9 

91 

819 

81 

8,281 

Indiana 

70 

680 

47,600 

4,900 

462,400 

Iowa 

48 

591 

28,368 

2,304 

349,281 

Kansas 

36 

453 

16,308 

1,296 

205,209 

Kentucky 

35 

295 

10,325 

1,225 

87,025 

Louisiana 

37 

199 

7,363 

1,369 

39,601 

Maine 

21 

141 

2,961 

441 

19,881 

Maryland 

84 

288 

24,192 

7,056 

82,944 

Minnesota 

67 

594 

39,798 

4,489 

352,836 

Mississippi 

13 

141 

1,833 

169 

19,881 

Missouri 

98 

632 

61,936 

9,604 

399,424 

Montana 

17 

97 

1,649 

289 

9,409 

Nebraska 

27 

350 

9,450 

729 

122,500 

Nevada 

5 

26 

130 

25 

676' 

New Hampshire 

17 

91 

1,547 

289 

8,281 

New Mexico 

S 

67 

536 

64 

4,489 

: N. Carolina 

32 

385 

12,320 

1,024 

148,225 

N. Dakota 

10 

130 

1,300 

100 

16,900 

Okiahoina 

39 

403 

15,717 

1,521 

162,409 

Oregon 

27 

233 

6,291 

729 

54,289 

Rhode Island 

31 

124 

3,844 

961 

15,376 

S. Carolina 

15 

182 

2,730 

225 

' 33,124 

S. Dakota 

8 

146 

1,168 

64 

.... 21,316 

Tennessee 

38 

299 

11,362 

1,444 

.89,401 

Utah 

11 

85 

935 

121 

..7,225 

Vermont 

10 

69 

690 

100 

. 4,761. 

Virginia 

48 

317 

■ 15,216 

2,304. 

900.: 

100,489. 

W. Virginia ^ 

;■ 30 . , 

167 

' 5,010 

.. 27,889 

Wisconsin 

: 93 : 

589 

54,777 

8,649 

346, 92i: 

Wyoming 

7 ' 

52 

, ,364 

-49. 

^ : .2,704 

Totals 

- 1,242 ■ 

9,549 

451, 5,58': 65,236;. 

3,610,185: 
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had 31,000 taxable personal iacomes in 1921, yet the former 
had 124,000 passenger cars registered, while the latter had 
246,000. Were the relationship perfect a single and unchang- 
ing value of the F-variable would always be found paired 
with a given value of the X-variable. 

Our first problem is the derivation of an equation to 
describe this relationship which, while not perfect, is clearly 



Taxable Personal Incomes in 1934 (Thousands) 


Fig. 69. — Scatter Diagram Showing the Relation between Taxable 
Personal Incomes and Passenger Car Registration, by States, in 1934, 
with Line of Average Relationship 

existent. There is here a relationship analogous to a trend, 
and it is apparently a trend which can be represented by 
a straight line. The equation to a straight line, fitted by 
the method of least squares to the points on the scatter 
diagram, will express mathematically the average relationship 
between these two variables. Such a line could, of course, 
be fitted by inspection, but a more accurate result will be 
obtained by the method of least squares. 

This calls for the solution of the following normal equa- 
tions: 

S(F) - Aa + 6S(Z) 

S(ZF) = dE{X) + 6S(Z2). 

The values required for the solution of these equations may 
be derived from the data as arranged in Table 86. Sub- 


THE EQUATION OF RELATIONSHIP 391) 
stituting, we have 

9,549 = 38a + 1,2426 
451,558 = 1,242a + 65,2366. 

Solving 

a = 66.321 
6 = 5.659. 

The required equation is 

F = 66.321 + 5. 659A.1 

This line is plotted in Fig. 69. 

A mathematical expression has now been secured for 
the relation between the two variables being studied, the 
number of taxable personal incomes, by states, and the 
number of passenger automobiles registered. The former 
is the independent or X-variable in the equation, the latter 
the dependent or F-variable. This equation constitutes a 
measure of the functional relationship between these two 
variables, but it is only an expression of average relationship. 
How' significant is the equation? If the relationship were 
perfect, and the plotted points all lay on the line describing 
this relationship, the equation could be used with confidence 
as an accurate instrument for determining the value of one 
variable from a value of the other. But a line with a definite 
equation may be fitted to points which depart very widely 
from it, w'hieh are widely dispersed. In such a case the 
equation may have the appearance of describing a precise 
relationship but the variation is so great that it cannot be 
used with confidence. It is the same problem as that which 
arises when an average is employed. We must know how 
significant the average is, how great the concentration about 
it, before we may use it intelligently. So the equation of 

^ 111 the chapters on correlatidn capital letters X, F, etc.) are used to 
represent original values of the ■variable- quantities,- as measured from the mm 
points on the scales of actual values. Capital letters with prime marks are 
used to measure deviations from- arbitrary -origins, X' and F' for such devia- 
tions in class-interval units,' X''’; and. F' for. -such:, deviatioiis in original units 
of measurement. Small letters etc.-) .are used to represent values of the 

variables expressed as deviations from their respective arithmetic means. 
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relationship between variables means little unless we know 
to what extent it holds in practical experience. We must 
have a measure of the dispersion about the line we have 
fitted. 

In describing the frequency distribution, the standard 
deviation is used as the best general measure of variation. 
It is, obviously, the measure we need in determining the 
reliability of the equation of average relationship. The 
standard deviation about this line will not only serve as a 
general index of the significance of this equation but will 
enable us to measure the degree of accuracy of estimates 
based upon the equation. 

THE COMPUTATION OP THE STANDARD ERROR 
OP ESTIMATE 

The standard deviation about a line of average relation- 
ship, being a measure of the accuracy of estimates, may 
be termed the standard error of estimate. The term standard 
deviaiion is generally confined to the root-mean-square 
deviation about the arithmetic mean. The standard error 
of estimate is represented by the symbol AI. 

In the computation of jS we must know the computed value 
of Y which corresponds to each given value of X. By 
substituting the given values of X in the equation 

F = 66.321 + 5.659Z 

normal Y values may be computed. The deviations of the 
actual F values from the computed may then be determined. 
The root-mean-square of these deviations is the required 
measure. A method of computation is illustrated in Table 87. 
From this table we have 

= 105.3 (thousand) motor cars, 

(The symbol is used, as this is the standard error of the 
F-variable.) 



Table 87 

Computation of Standard Error of Estimate 


(1) 

(2) 

No, of passenger 
cars registered, 

State 

1934 

{in thousands) 

Y -actual 

Alabama 

192 

Arizona 

80 

Arkansas 

162 

Colorado 

246 

Connecticut 

310 

Delaware 

45 

Florida 

280 

Georgia 

317 

Idaho 

91 

Indiana 

680 

Iowa 

591 

Kansas 

453 

Kentiick}^ 

295 

Louisiana 

199 

Maine 

141 

Maryland 

288 

Minnesota 

594 

Mississippi 

141 

Missouri 

632 

Montana 

97 

Nebraska 

350 

Nevada 

26 

New Hampshire 

91 

New Mexico 

67- 

N, Carolina 

385 

N. Dakota 

130 

Gklahoma 

403 

.Oregon 

233 

Rhode Isiand 

124 

S. Carolina 

182 

S. Dakota 

146 

Tennessee 

299 

Utah 

85 

Vermont 

69 

Virginia 

317 ^ 

W. Virginia : 

.167 

Wisconsin 

-589 :. ■ ■ ■ 

Wyoming , ■ ; . 
Total 

62':' 


BBl 


. (3) (4) (5) 

d 

^-computed (2) — (3) , , 


196 

.5 


4 

.5 

20 

.25 

128 

.6 


48 

.6 

2,361 

,96 

139 

.9 

4~ 

22 

.1 

488 

.41 

241 

.8 

+ 

4 

.2 

17 

.64 

581 

.3 


271 

.3 

73,603 

.69 

128 

.6 


83 

.6 

6,988 

.96 

253 

.1 

+ 

26 

.9 

723 

.61 

281 

.4 

+ 

35 

.6 

1,267 

.36 

117 

.3 

- 

26 

.3 

691 

.69 

462 

.4 


217 

.6 

47,349 

.76 

337 

9 

+ 

253 

.1 

64,059 

61 

270 

0 

+ 

183 

.0 

33,489 

00 

264 

4 

+ 

30 

.6 

936 

36 

275 

7 


76 

3 

5,821 

69 

185 

2 


44 

2 

1,953 

64 

541 

7 

- 

253 

.7 

64,262 

25 

445 

5 

+ 

148 

5 

22,052 

25 

139 

9 

+ 

1 

.1 

1 

21 

620 

9 

+ 

11 

1 

123 

21 

162 

5 


65 

5 

4,290. 

25 

219. 

1 

+ 

130. 

9 

17,134. 

81 

94. 

6 


68. 

6 

4,705. 

96 

162. 

5 


71. 

,5 

6,112. 

25 

111. 

6 


44. 

6 

1,989. 

16 

247. 

,4 

+ 

137. 

,6 

18,933. 

76 

122. 

9 

+ 

7. 

I 

50. 

41 

287. 

0 

+ 

116. 

,0 

13,456. 

OO 

219. 

1 

+ 

13, 

9 

193. 

21 

241. 

8 


117. 

8 

1.3,876. 

84, 

151. 

2 

+ 

30. 

8 

^948. 

64 

111. 

6 

+ 

34. 

4 

1,183. 

36 

281. 

4 

*+ 

17. 

.6 

309. 

76 

128. 

6 


42. 

6 

^ 1,814. 

76 

122. 

9 

— 

■ 53 . 

-9: ■■ 

■ 2 , 905 . 

21 

338. 

0 


21. 

,0 

. 441. 

00 

236. 

1 


69. 

1 

-4,774. 

81 

592. 

6: - 


3 . 

6 

. 12 . 

96 

105. 

9 : 


53 . 

9 - 

: 2,905. 

21 






,421,250. 

91 
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The measure is to be interpreted in precisely the same 
way as the standard deviation about an arithmetic mean. 
Given an approximately normal distribution of items about 
the line of relationship, 68 per cent of all the cases will lie 
within a range of ± <S (in this case 105.3), 95 per cent 
will fall within ± 25 (in this case 210.6) and 99.7 will 
fall within ±3*8 (in this case 315.9). If there were no 
scatter about the line fitted to the points representing the 
corresponding values of X and Y, S would have a value 
of zero, and the value of Y could be estimated from the 
value of X with perfect accuracy. The less the dispersion 
about the line, the smaller the value of S. The value of 
S serves, therefore, as an indicator of the significance and 
usefulness of the line which describes the relationship 
between the two variables. The standard error, it should 
be noted, is expressed in the same units as the original 
F-values. 

THE MAKING OF ESTIMATES 

We may, for a moment, consider the significance of these 
results. Let us assume that, not knowing the number of 
motor vehicles registered in a given state, we are under the 
necessity of estimating it. Two methods are open to us. 
We may, in the first place, base the estimate upon our 
knowledge of the T-variable alone. The total number of 
passenger automobiles in the 38 states included in the 
study is 9,549,000. Dividing this by 38 we have 251,289 
as the average. With no specific information as to the 
registration in a given state, the arithmetic mean of all 
the state figures would be taken as the most probable value 
for the state in question. (The most probable value of a 
series of observations is the mean of the series.) How may 
we judge of the accuracy of this estimate? The standard 
deviation of the original distribution is a measure of the 
degree of variation about the mean and, therefore, a measure 
of the accuracy of an estimate based upon the mean. If 
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the distribution approximates the normal type, the chances 
are 68 out of 100 that the true value for the state in question 
will not differ from the mean by more than the standard 
deviation. The standard deviation of passenger automobile 
registration by states, as recorded in Table 86, is 178.5. 
The mean affords, therefore, a basis for a reasonable estimate, 
and the standard deviation affords some indication of the 
probabilities involved in making this estimate. 

Another method of estimating the motor vehicle registra- 
tion in a given state is open to us if we know the number 
of taxable personal incomes in that state. We know, as a 
result of the study described in the preceding pages, that 
the average relationship between passenger car registration 
and number of taxable personal incomes is described by the 
equation 

Y = 66.321 + 5.659X. 

(The unit is 1,000 for each variable, it wnll be recalled.) 

If a state has .50,000 taxable personal incomes, it may be 
estimated from this equation that there are 349,271 passenger 
automobiles registered in that state. This is the most prob- 
able value of Y as determined from the equation of average 
relationship. Is this estimate any better than the previous 
one, which took the mean Y as the most probable value? 
Does our knowledge of the average relationship between 
X and Y aid us in estimating the value of Y from a known 
value of X? 

The answers to these questions are given by the standard 
error of estimate, and by the relationship between the 
standard error of estimate and the standard deviation of F. 
The standard error of estimate (that is, the standard devia- 
tion about the line of average relationship) k 105.3. The 
standard deviation of F is 178.5. Clearly the estimate 
made from the equation is more accxxrate than the estimate 
based upon the value of the mean F. In the former case 
the odds are 68 out of 100 that the error will not exceed 
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105 3 or, in terms of the original units, 105,300 vehicles. 
When the estimate is made from the mean, the odds are 
68 out of 100 that the error will not exceed 178,500 vehicles.* 
From our knowledge of the relationship between the two 
variables, even though that relationship is by no means 
constant or perfect, we are able to reduce materially the 
errors of estimate. 

THE COEFFICIENT OP CORRELATION 

We have now secured two measures which aid us in 
describing the relationship between variable quantities. 
The first is the fundamental equation of relationship, the 
expression of the degree of change in one variable associated, 
on the average, with a given change in the other. The second 
is the standard error of estimate, the measure of the degree 
of “scatter” about the line of average relationship. The 
standard error resembles the standard deviation in that 
it is a measure expressed in absolute terms, in the units 
employed in measuring the original F-values. This measure 
enables us to determine in a given case the probability that 
an estimate based upon the equation of relationship will 
fall within certain limits. 

In measuring variation it has been found that an abstract 
measure of variability is needed, one which is divorced 
from the absolute terms of the given problem. Such a 
measure is particularly needed, it was noted, when different 
distributions are to be compared. So, for measuring the 
degree of variability, a coefficient of variation is employed. 
There is need of a somewhat similar measure in connection 
with our present problem. We need a measure of the 
degree of relationship between two variables, an abstract 
coefficient which is divorced from the particular units 

* In the present case, with a limited number of items and distributions which 
depart somewhat from the normal type, the precise probabilities cannot be so 
accurately determined from the values of Sy and o-y. With this qualification 
in the matter of interpretation we may use Sy and as useful measures of 
dispersion. 



COEFFICIENT OF COREELATION 335 


employed in a given case. Karl Pearson has developed such 
a coefficient. 

This measure may be explained in terms of the preceding 
discussion. It was found that the usefulness of estimates 
based upon the equation of relationship could be determined 
by comparing the standard error of 7 (the measure of 
scatter about the line of relationship) with the standard 
deviation of Y. If the standard error be as great as the 
standard deviation the equation of relationship is of no 
use to us, but if the standard error be less than the standard 
deviation the accuracy of estimates may be improved by 
using this equation. The significance of the equation is thus 
indicated by the relation between the standard error and 
the standard deviation. But these are both in absolute 
terms, so that by dividing one by the other an abstract 
measure may be secured. Thus we might write 


S 

Measure of correlation = —• 

a, 

A somewhat more useful measure is secured by putting the 
ratio in this form : 


Measure of correlation = 



This measure, when used in connection with a linear equa- 
tion, is called the coefficient of correlation and is represented 
by the symbol r. 

A brief consideration of this formula will help to make 
clear the significance of r. If there is no dispersion about 
the line of relationship, will have a value of zero; the 
equation describes a perfect relationship between the two 
variables. In this case, as is clear from the formula, r must 
have a value of 1. 

The maximum value of Sy is one which is equal to a-y. 
Under these conditions, when the equation of relationship 
is of no aid in improving our estimates, the formula will 
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give zero as the value of r. Such a value indicates that 
there IS no relationship between the two variables: in other 
words, that the straight line of best fit is horizontal, passine 
through the mean of the F’s. It shows that there is no 
tendency for the high values of F to be associated with 
^gh values of Z or for high values of F to be associated 
with low values of Z. The two variables fluctuate in absolute 
independence. In such a case the deviation of each point 
from the fitted line is equal to its deviation from the mean 
aud the two root-mean-square deviations are equal a^ 

Zero and unity are thus the limits to the value of r 
rhe values found in practical work fall somewhere between 
these limits approaching unity in cases where the degree 
of relationship is high. The greater the value of r, the 
greater the confidence that may be placed in the equation 
as an expression of a relation which is approximated in a 
h J ,»rce„ age of In the example presented above! 

dealing mth motor vehicle registration and number of 
taxable personal incomes, we have 


= .81. 


(105.3)2 

(178.5)2 


a definite and fairly dose connection 
s^pir ^ "" variables for the states included in the 

correlation may be made somewhat 
t^reoSi f T it the sign of the constant b in 
the sbot of . This sign indicates whether 

attached in r Positive or negative and, when 

attached to r, enables us to tell whether the relationship 

direct or mverse. Thus in the present case high values 

iL paired with high values of the other. 

litten? 1r should be 

tten + .81. When cotton production and prices are 
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correlated the relationship is an inverse one, for high values 
theZhir associated with low values of 

1 “easurement of relationship in a given case is cona- 
^eted when we have secured the tb-ee measures described. 
Ihe eguahon of average relationship is an expression of 
'the underlying law connecting the two variables, if such a 
law may be assumed. The standard error of estiMate meas- 
ures the variation m absolute terms, about the line of reL 
tionshj. The coefficient of correlation is an abstract measure 
in prlctke^^ relationship actually holds 

DETAILS OF CALCULATION 

In the preceding section the attempt has been made to 
eiptan the vanous measures necessary in studying the 
relationslnp between variable quantities without introducing 
a detailed explanation of procedure. We may now return 
to a consideration of the details of calculation, including 
certain methods by which this calculation may b rcduc J 
to a minimum. 

The procedure followed in the preceding illustration is a 
logwal one to employ m deriving the three required values. 
Ihis method is capable of general application, but the labor 
involved may be materiaUy reduced by taking advantage 
of a short-cuLmetood of deriving iS,. This method may be 
first explained with reference to data of the type dealt 
with aWe. And, for the present, the discussion will be 
confi^d to cases m which the relationship between variables 
may be described by a straight line. 

derivation of the equation of 

relationship. A line of the .type : 

F = a + 6X 

is fitted by the method of least ■ squares. 

The next step is the computation of Sf, the square of the 
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standard error of estimate. This was done in the above Elus- 
tration by measuring the deviation of each individual obser- 
vation from the fitted line, and getting the mean-square of 
these deviations. It may be shown^ that this value can be 
derived from the following equation: 

, , _ S(F^) - aS(7) - hSiXY) 

= N 

The quantities a and b are the constants in the equation to 
the fitted straight line. The other values relate to the 
original observations. Substituting in this equation a and b 
and the other necessary values, taken from Table 86, we have*^ 

^ The standard error of estimate is computed from the formula 

where d represents a single deviation from the fitted line^ or the difference 
between the actual and the computed value of Y in a given case. The latter 
is derived from the equation 

Yc^a + bX. 

(The symbol Tc is used to represent the computed value of F.) 

If we let F represent the actual value, we have, for each residual, 

d = Fc - F 
or 

d = a -f hX - F. (1) 

There will be as many equations of this type as there are points. Multipij^- 
ing each one by d, and adding, we have 

S(d2) =. aS(d) + 6S(dZ) - S(dF). (2) 

But, since the line was fitted by the method of least squares, 

Z{d) 

X(dX) =0 

(for proof of this see Appendix A) 
and, therefore, 

S(da) = --,2;(dF). , ■ . „ , (3), 

Returning again to equation (1), we may multiply throughout by F, and 
add, securing * *' ^ ' 

S(dF) - oS(F) + te(XF) -- S(F®). (4) 

Substituting the equivalent of S(dF) in equation (S), we have 

S(d2) = S(F2) - oS(F) - h2iXY) (5) 

from which the given formula fov is derived. 

® For this calculation the values of a and b are given to a greater number of 

decimal places than in the equation as first presented. 
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O 2 _ 3,610,185 - (66.32136 X 9,549) - (5.65925 X 451,558) 
Sy 38 

= 11,090 
Sy = 105.3. 

From this point the procedure may follow that already- 
described, r being computed from the formula 



The coefficient r may be secured, however, without com- 
puting S as an intermediate value. The above formula for 
r may be reduced to 

_ aS(F) + b2(XY) - Ncy^ 

S(F0 - Ac/ 


where Cy is the difference between the mean Y and the 
origin employed in the calculations.^ If the origin is zero 


1 Tlie formula 


r2 = 1 


may be written 


in which y refers to deviations from the arithmetic mean of the F's, But 

(SF2) 


N 


N 


■ Cy" 


where F represents a deviation from an arbitrary origin (in this case Jsero on the 
original scale) and % represents the difference between this origin and the 
mean of the F's. 

Therefore 

- - 1 S(d2) 

/* “ s(y») - Ac/ 

SubstitutingJn this equation the equivalent of as given in the footnote, 
on page 338^ 

' : : ^..^1 , S(F^)'-aS(y) 

'S(F2) 

_ gS(y) + hsprF) 

Z(F^) 


Simplifying, 
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on the original Y scale, Cj, will be equal to the arithmetic 
mean of the F’s. 

In the present case, using the data of Table 86, we have 


Cy 


9,549 

38 


251.289. 


The other values are the same as those employed above in 
computing /S„. Substituting in the formula, we have 

2 = 789,228.14 
^ 1,210,630.86 

= .6519 
r = .81. 


In effect, then, the labor of fitting a straight line by the 
method of least squares gives us practically all the values 
needed in securing S and r, the two other measures necessary 
for a complete description of the relationship between two 
variable quantities. The only additional values required 
are S(7^) and c„. 


The Construction of a Correlation Table 

In the example presented above we had only thirty-eight 
observations. With a larger number it becomes practically 
impossible to retain the individual values in the study of 
relationships. These individual items must be grouped in 
significant classes, and all computations must be based 
upon these grouped data. This means, merely, that we must 
handle data organized in frequency distributions. Since 
we are dealing with two variables, however, the simple 
frequency table must be modified to meet the needs of 
the present problem. Such a modified frequency table, 
arranged to facilitate the computation of the values needed 
in studying relationship, is termed a correlation table. 

As a typical problem, involving the construction of such 
a table, we may consider the relation between the discount 
rates of Federal Reserve banks and the corresponding dis- 
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count rates of commercial banks. Since the paper discounted 
by commercial banks may be rediscounted at the Federal 
Reserve banks by the member banks, some degree of 
relationship between the rates may be expected. Our present 
object is the measurement of that relationship. 

The first step is the tabulation of the original observa- 
tions. Monthly values of each variable^ were secured for 
each of the twelve Federal Reserve cities over a period of 
150 months, from July, 1920, to December, 1932. In the 
process of tabulation the items must be combined so that 
a Federal Reserve bank discount rate is paired with the 
corresponding rate charged by the commercial banks of the 
same city. Fig. 70 illustrates the method of tabulation. 

Tabulation having been completed, a correlation table 
designed to facilitate the later computations may be con- 
structed. Table 88 illustrates a suitable form. 

In Table 88, it will be noted, an arbitrary origin is em- 
ployed for each variable, and the class-interval unit is used 
in the calculations. We here employ the symbols X' and 
F' to represent deviations from the arbitrary origin (which 
is located at point 1 . 50, 3 .50 on the original scales). 

COMPUTATION OP MEASUEES OF KELATIONSHIP 

From this correlation table all the values needed in 
fitting a straight line to the data, and in computing the 
measures S and r, may be derived. The quantities 2(X'), 
S(A'®), S(F'), and S(F'®) are computed by methods already 
familiar to the student. The product of the paired values 
S(X'F') may be computed directly from the correlation 
table, but it is perhaps simpler for the beginner to re-arrange 
the data in columnar form, as in Table 89 on page 345. 
When the figures are disposed in this way one line is em- 

^ The discount rates of the Federal Eeserve banks relate, for the first part 
of the period covered, to trade acceptances; for . later' years they are “rates 
for member banks on eligible paper,’^ The commercial bank rates are those 
charged on customers^ prime, commercial paper., „ . The customary rate over 
a given 30 day period was taken as of the' middle of that period. 
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Table 89 

Discoufit Rates. of Federal Reserve Bajiks and 
Commercial Banks 

(Computation of values required in curve 
yr 


X' 

0 

1 

2 

4 

0 

1 

2 

3 

4 
2 

3 

4 

5 

6 
2 

3 

4 

5 

6 
7 

3 

4 

5 

6 
7 

4 

5 

6 

7 

8 
9 
.6 
7 
'8 
:9; 
10 
41 

7 •" : 


Discount Rates of 

fitting) 


0 

0 

0 

0 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

3 

3 

3 

3 

3 

3 

4 
4 
4 
4 

4 

5 
5 
5 

5 

■ 5 
5 ' 

6 
6 
6 
6 
6 
6 

7. ■ 

.7. ■: 


/ 

/X'F 

4 

0 

2 

0 

1 

0 

9 

0 

1 

0 

9 

9 

19 

38 

19 

57 

29 

116 

16 

64 

27 

162 

122 

976 

96 

960 

3 

36 

4 

24 

25 

225 

111 

1,332 

196 

2,940 

65 

1,170 

1 

21 

1 

12 

90 

1,440 

164 

3,280 

175 

. 4,200 

45 

1,260 

9 

180 

21 

525 

146 

4,380 . 

150 

5,250 

8 

320 

32 

1,440 

4 

■ 144 , 

22 

024 

10 

480 

22 , 

1,188 

1 

^ ■ - 60 

3 

198 

■ 5 ' 

^ ■ ■■ 245' 


. ^ 224 
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Table 89— Continued 

Discount Rates of Federal Reserve Banks and Discount Rates of 


Z' 

Commercial Banks 

r / 

/Z'F 

9 

7 

63 

3,969 

10 

7 

9 

■ 630 

11 

7 

36 

2,772 

9 

8 

7 

504 

10 

8 

9 

720 

11 

8 

1 

88 

9 

9 

1 

81 

10 

9 

1 

90 

11 

9 

2 

198 


42,932 


ployed for each compartment of the original correlation 
table in which items have been recorded. 

The values required in fitting a straight line and in 
computing the standard error and the coefficient of correla- 
tion are; 

N = 1,800 S(Z'2) = 62,354 

S(Z') = 10,054 2(ZT0 = 42,932 

S(F') = 6,904 S(F'i') = 30,878. 

The equation to the best fitting straight line is found to 
be 

F' = - .10277 + .70509X'. 

Substituting in the formula 

- aS(F') - bS(Z'F') 

we have 

j _ 30, 878 - (- .10277 X 6,904) - (.70509 X 42,932) 

* ” ‘ 1,800 
=.7314 
= .856. 

To determine the value of the coefficient of correlation 
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we have only to substitute the proper values in the equation. 

aSjY') + hZjX’Y') - 
S(F'2) - 

When this is done we have 

2 = (- • 10277 X 6,904) + ( -70509 X 42,932) - (1,800 X 14 .71149) 
30,878 - (1,800 X 14 . 71 149) 

3,080.7178 

4,397.3180 

= .70059 
r = + .837. 

All these calculations have been carried through in class- 
interval units, with reference to an origin at point 1 .50, 3 .50 
on the original scales. The value of r is not affected by this 
fact, but the estimating equation and the standard error 
of estimate should be corrected. 

The value of Sy, in class-interval units, is .855. Since 
the class-interval of the F-variable is .50, we have, in 
original units, 

Sy = .50 X .855 
= .4275. 

The equation may be corrected in a similar fashion. 
The class-interval being .50 both for X and Y, each unit 
on the original scale equals two class-interval units. Thus 
a range of 4 points on the original scale is equivalent to 
a range of 8 points on the class-interval scale. For conven- 
ience we may use F" and X" to define deviations in original 
units (i.e., deviations from the arbitrary origin), where we 
have used F' and X' to define corresponding deviations in 
class-interval units. Then, for any stated deviation, 
2F"' = F'; similarly 2X" = X'. Retaining the values of 
o and 6 in the equation of average relationship, and sub- 
stituting 2F" for F' and 2X" for X', we have 

2F" = - . 10277 + . 70509(2X"). 
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Simplifying this, we have 

F" =- .05138 + .70509Z" 

which is the equation in terms of original units. 

This equation refers to an origin whose coordinates 
are 1.50 and 3.50 on the original scales. That is, 
F" = F - 3.50, and X" = X - 1.50, where Y and X 
define deviations, in original units, from the zero points 
on the original scales. Making these substitutions we have 

F - 3.50 = - .05138 + .70509(X - 1.50). 

Simplifying, and rounding off the constants by dropping 
figures that are not significant, we have 

F = 2.391 + .705X. 

We have now the three values required for determining 
the relationship between Federal Reserve discount rates 
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^ 7.25 

<» 

^ 6.75 

CL 
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<0 
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?3J5 
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X-Federal Reserve Bank-Rate-Percent 

Fig. 71. — Scatter Diagram of Federal Reserve and Commercial Bank 
Rates, with Line of Average Relationship and Zones of Estimate 

and corresponding commercial bank rates, during the period 
covered. The equation describes the average relationship, 
the standard error of estimate serves as a measure of the 
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reliability of estimates based upon this equation, and the 
coefficient of correlation serves as an abstract measure of 
the degree of relationship between the two variables. 

The significance of the standard error, Sy, is brought out 
graphically in Fig. 71. The line of average relationship 
has been drawn on this scatter diagram, and what may be 
called “zones of estimate” have been marked out about 
this line. Within the zone having a width equal to 2 jS, 
centering at the fitted straight line, 68 per cent of all the 
points should fall, on the assumption that the distribution 
is normal. Within the zone having a wfidth equal to 6>S, 
centering at the fitted straight line, 99.7 per cent of all 
the points should fall, on the same assumption. The smaller 
the value of S the narrower these zones would be, and hence 
the more accurate would be the estimates which are based 
upon the equation of average relationship. 

The Product-Moment Formula fob the Coefficient 
OF Correlation 

In the preceding examples the coefficient of correlation 
has been computed from the formula 

•> = «S(F) -h iSjXY) - Nc/ 

2 ( 72 ) _ 

which is based upon relations involved in fitting a straight 
line by least squares. The usual formula differs somewhat 
from this, and it is advisable that the student be familiar 
with it. 

When a straight line is fitted to data, the origin being 
at the point of averages, the two normal equations 

2(7) = Na + bl^(X) 

2(Z7) = a2(X) -t- 62(X2) 


X(y) ^ Na + bS{x) 
2 (ar 2 /) = a2(a-) -i-h2(a-2) 


become 
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where ^ and x measure deviations from the point of averages. 
The first of these equations disappears and the second 
reduces to 

'2{xy) ^ bSix^) 


for 


S( 2 ) = OandSfj/) = 0. 

The slope, b, is the only constant required, and this may be 
computed from the relationship 


h 




Under the same conditions the formula 


.2 = aS(F) + bSiXY) - 
S(F=) - NCy^ 


reduces to 


_ b’Sjxy) 


for Cy — Q when the deviations are measured from the mean 
of the F's. Substituting for h its equivalent, as just deter- 
mined, we have 

r2 = ■'2,{xy) 

But S( 2 /=) - Ar<j/ and 2(3:**) = 

Therefore 


and 


■2 _ 2(a:y) ■ l,{xy) 


in which x and y refer to deviations from an origin at the 
point of averages. 
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This formula may be expressed 

r = P 

in which 

^ N 

The quantity p is the mean product of the paired values of 
X and y. 

The computation of the coefficient of correlation from 
this formula proceeds along lines somewhat different from 
those outlined above. As we have seen, both the arithmetic 
mean and the standard deviation may be readily computed 
by the selection of an arbitrary origin from which all 
deviations are measured, a later correction being made to 
offset the error involved in using this arbitrary origin. 
Similarly, the mean product p may be computed by a short 
method, requiring the use of assumed means and the applica- 
tion of a correction at the end of the process. 

If x' and y' represent deviations from points arbitrarily 
selected as assumed means, while p' represents the mean 
product of such deviations, then 

__ W) 

^ N 

The computation of is not difficult, for deviations may 
be measured from central points, and may be expressed in 
class-interval units. Having p' we may secure the true 
mean product from the formula 

P ~ P CxCjf 

in which c„ and represent the differences between the 
true and assumed means of the x’s and y’s, respectively. ^ 

‘ The following is a proof of this relationship: 

*' = deviation of any point from assumed mean of s’s 
* — deviation of same point from true mean of a:'s 
Cx = difference between true and assumed means of a)’s 
y’ — deviation of same point from assumed mean of y'B 

(Footnote 1 conlinued m page SS2) 
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V = 


CxCu, 


The Product-Moment Method, Ungrouped Data 

Ihis method may be illustrated with reference, first, to 
ungrouped data, using the figmus for personal incomes (X) 
and passenger car registration (F), by states. The values 
required for this computation, as given in Table 86, are 

iV = 38 
S(Z) = 1,242 
2(F) = 9,549 
2(X2) = 65,236 
2(F2) = 3,610,185 
2(ZF) = 451,558. 

The mean product may be computed from the formula 

. g(M) - S(a:V0 
N N 

We may select as arbitrary origin the actual origin on the 
two original scales. Hence we have 

P CxCy, 

origin is at zero on the original scales the 

For the two standard deviations 

(Footnote 1 continued Jr mn page $51) 

5 I same point from true mean of v’s 

% - difference between true and assumed means of ,/s 

=='2/4" Cy 

V + Cy) = a;2, 4- c^y 4 . ^ 

for the sum of all such products for AT points, we have 

But ~ +<^'S{x) -f-Vc^cy. 

rp, ' « '^iv) = 0 and S(a)) = 0, 

Therefore W) = s(.r^) 4- Nc^y 

2(xy) Z{xy) 
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= /|/- 


(Z!) 

N 


These measures may be computed readily from the values 
secured from Table 86: 


= 32.684 
38 

1,068.2439 


9,549 




38 

63,146.1615 


251.289 


451,558 
38 

3,670.0753 


8,213.1297 


CTx = /|/' 


6^36 
' 38 ~ 


1,068 


.2439 a, = _ 53 , 146 . 16 I .5 


= 25.47 = 178,49 

Solving for the coefficient of correlation, 
p _ 3,670.0753 


}• = 


25.47 X 178.49 


+ .8073. 


The equation to the straight line which describes the 
average relationship between X and F may be derived 
from the values required for the preceding calculations, 
l/^ffien the origin is at the point of averages this equation 
may be written 


V 


r—^x. 

O'® 


Substituting the proper values, we have 
y 


, 178.49 

+ . 8073 -rtr-T:r a: 


25.47 

= 5.657a:. 

This, with an insignificant difference, is the equation secured 
by the method of least squares. The constant term repre- 
senting the y-intercept disappears, since the origin is at 
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the point of averages, through which the least squares line 
must pass.^ 

When the product-moment method is employed in com- 
puting the coefficient of correlation and in determining the 
equation of regression, the standard error, 5„, may be 
derived by a simple change in the formula first presented 
for r. From the expression 



we may secure the formula 

By — ITyVl — 

which enables us to compute Sy, if we have the values of 
cty and r. In the present case, 

By = 178 . 49^1 - .8073 
= 105 . 3 . 

The Product-Moment Method, Classified Data 

The product-moment method is also applicable to cases 
in which it is necessary to construct a double frequency or 


* That the formula y = r—x is equivalent to the formula based upon the 

method of least sqiiares may be readily demonstrated. When the line passes 
through the point of averages, the equation, F — a -f feX, becomes y ~ hx. 


But b = We may write, accordingly, yc =* 


S(aj2) 

This is equivalent to 

for the latter may be written 




(Ty 

yc - X 


( 1 ) Vc 


m vc = 


■ ^ 


( 3 ) yc 


m vc 


:s(xy) 


\ AT , ¥ at 


N- 


(The symbol yc is employed for the computed value of y^ in these equations, 
to avoid confusion with the actual which appear in the right-hand members 
of the equations.) 
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correlation table. The procedure is shown in detail in 
Table 90. 

This table is identical with that previously presented for 
the same data, except that a different arbitrary origin has 
been selected. 

The value 4.60 is adopted as the assumed mean of the 
X’s (^' 1 ), and the value 5.50 as the assumed mean of 
the F’s (M'y). Deviations are measured in class-interval 
units from this origin. In each compartment of the correla- 
tion table there are three figures, involved in the computation 
of hix'y'). The figure in the center indicates the number 
of items falling in that compartment. Thus there are 
seven pairs having X values between 5.75 and 6.25 (mid- 
point 6.0) and Y values between 7.25 and 7.75 (mid-point 
7 . 5 ). For each of these pairs x' (the deviation from the 
assumed mean of the X's) is -f- 3, in class-interval units, 
and y' (the deviation from the assumed mean of the F’s) 
is -h 4, in class-interval units. For each pair, therefore, 
x'y' = + 12. This figure appears at the top of the compart- 
ment. But there are seven pairs in this compartment, so 
the sum of x'y' for this group is ■+■ 84. This figure appears 
in parentheses at the bottom of the compartment. To 
secure ^(x'y') for the entire table it is necessary to add 
algebraically the values secured in this way for all com- 
partments. The addition is first carried out for the different 
rows, the sub-totals being given in the column at the right 
of the table. It is found that S(a;V) = + 4,492, in class- 
interval units. 

Details of the computation of the coefficient of correlation 
are given in Table 91 on page 358. The values of c* and c, 
are obtained by methods already familiar. 

We have, from that table, 
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This is the value of p, the mean product, in class-interval 
units. Proceeding, 

T = - V 

N(T xCfy (X y 

- + .837. 

In computing r, both the numerator and denominator of 
the final fraction (the mean product and the two standard 
deviations) are in class-interval units. Since this is true, 
T may be computed directly without reducing the figures 
to the original units. The entire operation, therefore, is 
carried on in simple class-interval units. 

Table 91 


Calculation of the Coefficient of Correlation between the Discount Rates 
of Commercial Banks and of Federal Reserve Banks 
(Caiculations based on the entries in Table 90) 


ilirx=4.50 

M'y 

*5.50 

-746 


- 296 ' 

f s= — __ — St — 414 


* — = - .164 

1,800 


1,800 

= (~ .414)2 = .171 


* (- .164)2 * .027 

&= =6:^ = 3,614 


»M^=t2.470 

1,800 


1,800 

<T*2 === - ^^2 

<^35,2 

« Sy^ — Cy~ 

= 3. 614 -.171 


* 2.470 - .027 

= 3.443 


* 2.443 

CTx = 1.856 

O', 

* 1.663 

Mx =* 4.60 - .5(.414) 

Mv 

* 5.50 - .5(.164) 

= 4.293 


*5.418 


N 

= - 
I’soo ~ 

= 2.4956 - 
« + 2.4277 
, _ P 


(~.414» - .164) 
- .0679 


+ 2.4277 
(1.855) (1.663) 
^ +2.4277 
2. 8994 " 
r = +^837 ; ' 


Note: The class-interval unit has been employed in all the computations 
shown in this table. 


In deriving the equation to the straight line which 
describes the average relationship' between a: and |/ from 
the formula 


y - r-^x^ ■, 

■ ■ 

ay and Vi should be expressed in units of the original scales.* 

^ When the class-intervals happen to be the same, as in the present case, 
the change is not necessary,, as-, the relation between numerator and denoini- 
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This is done by multiplying the present values by the 
class-intervals. 

(Tj, (in original units) = 1.855 X .50 = .9275 
<Sy (in original units) = 1.563 X .50 = .7815. 

Substituting the given values in the formula, we have 




.9275 


.705a:. 


The Lines op Regression 

In the above discussion certain terms ordinarily employed 
in the treatment of correlation have been pm-posely omitted. 
Several of these should be explained. 

The equation to the line of best fit in the preceding 
illustration was found to be 

y = .705x 

when the origin was taken at the point of averages. In 
this equation y is expressed as a function of x; that is, x is 
taken to be the independent variable and y the dependent 
variable. The equation expresses the average variation in 
y (discount rates of commercial banks) corresponding to a 
change of one unit in x (discount rates of Federal Reserve 
Banks). This line of relationship corresponds precisely to 
a line of trend, which describes the average change in a 
given series accompanying a imit change in time. A line 
which thus describes the average relationship between two 
variables is termed a line of regression. Its equation is 

termed a regression equation, and the quantity r — which 

gives the slope of such a line is called a coefficient of regression. 
The use of these terms dates back to early studies by 
Galton, dealing with the relation between the heights of 


nator is not altered. In,' practice it, is advisable always ,to express, tlie two 
standard deviations in orig,iiial units at'tfais .stage of the.calculatioiia', ' 
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fathers and the heights of sons. Sons, Galton f omd, deviated 
less on the average from the mean heights of the race than 
their fathers. Whether the fathers were above or below 
the average, the sons tended to go back or regress tow^ards 
the mean. He therefore termed the line which graphically 
described the average relationship between these two vari- 
ables the line of regression. The term is now used generally, 
as indicated above, though the original meaning has no 
significance in most of its applications. 

In any given case equations to two lines of regression 
may be computed. One is an expression of the average 
• relationship between a dependent F- variable and an inde- 
pendent X-variable; the other describes the relationship 
between a dependent X-variable and an independent 
F-variable. The significance of the two may be indicated 
graphically. 

Figure 72 is derived directly from the scatter diagram 
presented in Fig. 71. The circle in each column represents 
the mean F-value of all the items falling in that column. 
Thus in the third column there are 40 cases including all 
those with X-values falling between 2.25 per cent and 
2 . 75 per cent. The F-values vary, however, being distributed 
as shown in Table 92. 


Table 92 

Computation of the Arithmetic Mean of an Array 


Class4niewal 

Midrpoint 

m 

Frequency 
■ / 

fm. 

4.75 - 

- 5.24 

5,0 

4 

20.0 

4,25 - 

- 4.74 

4.5 

16 

: ■ 72.0 

3.75 - 

- 4.24 

4 0 

19 . 

76.0 

3.25 - 

- 3.74 

3,5 


: 3.5 




, . 40 

,- 171.5 


ilf = = 4.2875. ■ • 

40 

Similai’ mean values are obtained for the other columns. 


1 

These a: 
regressio 
In Fig 
rates) is 
to 4.5, 1 
mereial 1 
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re plotted in Fig. 72, together with the line of 
n of F on X. 

. 72 the X-variable (Federal Reserve bank discount 
independent. As it increases from 4.0 per cent 
).0, 5.5 per cent, and so on, the average of corn- 
bank rates increases also. An average commercial 
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1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 
Means of (3.60X3.90) C4.28)C4.56)(4.86)(5.11)(5.60)(5.96)(6.40) (6.69)(7.25) (7.02) 
Columns Federal Reserve Bank Rates— Percent 

Fig. 72. — Showing the Relation between Discount Rates of Commercial 
Banks and Federal Reserve Bank Discount Rates. (The broken line 
connects the means of the columns and the straight line shows the average 
change in commercial bank rates corresponding to a unit change in Federal 
Reserve bank rates; i.e., it represents the regression of F on A') 

bank rate of 4.29 per cent was associated with an average 
Federal Reserve bank rate of 2.5 per cent; an average 
commercial bank rate of 4.56 per cent was associated with 
an average Federal Reserve bank rate of 3.0 per cent, 
and so on. (The commercial bank rates cited are the means 
of the entries in the columns referred to.) The slope of 
th(^ straight line, which is the line of regression or the line 
of average relationship, measures the average increase in 
commercial bank rates corresponding to a unit increase in 
Federal Reserve bank rates. 

It is possible to view the relationship between these two 
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variables in another light. These questions arise: Given 
a certain commercial bank discount rate, what is the average 
Federal Reserve bank rate associated with it? And for a 
given change in commercial bank discount rates, what is 
the average change in the corresponding Federal Reserve 
bank rates? The commercial bank rate is now looked upon 
as independent, and the Federal Reserve rate as an associ- 
ated dependent variable. These questions are answered by 
Fig. 73. The points marked by the small circles and con- 



(6.63) 

(6.32) 

(6.29) 

(5.52) g 
§ 

( 4 . 80 )“ 

o 

(4.18) g 
(3.87) I 
(3.58) 
(2.93) 
(2.75) 


1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 6.25 5.75 6.25 6.75 
Federal Reserve Bank Rates— Rercent 


Fig. 73. — Showing the Relation between Federal Reserve Bank Discount 
Rates and the Discount Rates of Commercial Banks. (The broken line 
connects the means of the rows and the straight line shows the average 
change in Federal Reserve bank rates corresponding to a unit change in 
commercial bank rates; i.e., it represents the regression of X on Y) 


neeted by the broken line show the locations of the arithmetic 
means of the items falling in the various rows. Thus the 
16 X-items in the bottom row have an average value of 
2.75 per cent. This is the average Federal Reserve bank 
discount rate associated with a commercial bank rate of 
3.5 per cent. The average Federal Reserve bank rate 
associated with a commercial bank rate of 4.0 per cent is 
2.93 per cent, and so on. The straight line fitted to these 
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points indicates the relationship between the two, its slope 
measuring the average increase (or decrease) in Federal 
Reserve bank rates associated with a unit change in com- 
mercial bank rates. 

This is the line of regression of X on Y. The general 
formula for the equation to this line is: 

X = r—y. 

<r„ 

Substituting the present values, we have 


X — 


.837 


.9275 

.7815^ 


or 


X = .993^. 


The factors in this equation, it will be seen, are the same as 
those entering into the formula for the line of regression of 
y on x\ If r is equal to 1 the two lines coincide, and if, 
in addition, the two standard deviations are equal, the line 
of regression will bisect the angle formed by the axes. 
If the points be plotted on a chart scaled in units of the 
standard deviations, we have y = rx; the slope of the line 
of regression is then equal to the value of r. 

The coefficient of regression is represented by the sym- 
bol b. In a simple correlation problem there are two such 
coefficients, representing the slopes of the two lines of 


^ The formula 


may be reciuced to 


X ^ r — y 

^ 2 /* 


S(2/2) 


This is the equation to a line fitted to the points plotted in Fig. 73 in such a 
way that the sum of the squares of the horizontal dmaiions m ,a mmimtim. 
The formula 

is the equation to the line for which the.aum of the squares of the. veriiml 
dmiaiimis is & minimum. An understanding- of this point .may .make clear 
the difference between the two lines of .mgression.. 
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regression. These are 


1 7 indicate the relation between the two varia- 

bles. The first subscript refers to the dependent variable in each 

C^lSG. j . 

The coefficient r appears in both formulas. This being 
so, IS clear that r may be computed from the regression 
coefficients. For 


= l/v^ 


Thus if we know the slopes of the two lines of regression 
? may be determined. In the present example 

■r = V.'705 = .837 

USE OP THE EQUATIONS OP REGRESSION 

The two equations of regression given above 
y = .705z 


describe relations between deviations from the respective 
arithmetic means. That is, the origin is at the point of 
averages, and to use the equations we cannot use the 
original values of X and F but must express them as devia- 
lons from their means. For example, we wish to determine 
e normal commercial bank rate associated with a Federal 
Reserve bank rate of 6 per cent. The mean value of the J- 
variable (Federal Reserve bank rates) is 4.293 per cent. A 

nf a deviation from the mean 

+ 1./07. Substitutmg this value in the first of the 
above equations, we have 
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2/ = .705 X (+ 1.707) 

= + 1.203. 

This is the average y-deviation associated with an a:-deviation 
of + 1 . 707. To get the normal commercial bank rate 
associated with a Federal Reseiwe rate of 6 per cent the 
quantity + 1.203 per cent must be added to the mean 
commercial bank rate, 5 . 418 per cent. The value we wish 
is thus 6 . 621 per cent. 

This calculation has been rather round-about because 
of the form of the equation of relationship. This equation 
can be put in more appropriate form for such computations. 
Let 

X = arithmetic mean of the X’s 
F = arithmetic mean of the F's. 

Then 


may be written 



F - F = r- {X - X). 


In this last equation X and Y represent the values of the 
variables on the original scales, and not as deviations from 
their respective means. In terms of the coordinate chart, it 
means shifting the origin from the point of av'erages to a 
point corresponding to zero on each of the original scales. 

To illustrate the greater utility of the equation in this 
form, the equation 

y = .705a: 

may be changed in the manner indicated. It becomes 

F- 5.418 = .705(X - 4.293) 

= .705Z-3.027 
F = 2.391 + .705X. 

This is the equation with the origin so shifted that the 
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original values may be employed directly. To determine 
the commercial bank rate normally associated with a Fed- 
eral Reserve rate of 6 per cent we may substitute the latter 
value in the equation just secured. 

F = 2.391-1- (.705 X 6.0) 

= 6.621. 

Precisely the same results are secured as with the equation 
in the other form, but for many purposes it is preferable 

to have an equation in which the actual values mav be 
inserted. 

The equation 



may be similarly changed to 

Z-l = r-^(F- F). 


SuMMAEY OP COEEELATION PROCEDURE 

In the foregoing pages there have been presented two 
quite different methods of securing the values required in 
measuring the relationship between two variables. The 
steps in the two methods may be briefly summarized. The 
method of least squares is basic in both cases, but that term 
^ employed to describe the first method 

fundam1»'n!«l ^ ® ^ 

lunaamental step m that procedure. 

I- The Least Squares Method. 

A. Data to be handled as individual items. 

s^quaref'1^5™ f ^ 

Su npw ^gement of the data in columns 

vir the required 

Hon u ^(XY). The equa- 

betweertfaftwo^ariables”*""' relationship 
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2. Compute the standard error of estimate, from the 
formula ' 

O 3 _ - <^2(F) -- fe2(ZF) 

N 

Sy is a measure of the reliability of estimates based 
upon the equation of relationship, and is to be inter- 
preted in the same way as is the standard deviation 
about an arithmetic mean. 

3. Compute the coefficient of correlation j r, from, the for- 
mula 



, _ aE(Y) + b2(XY) - Ncy^ 

\ S(F2)-iVc„2 

Give r the sign of the constant 6 in the equation of regres- 
sion. This coefficient is an abstract measure of the degree 
of relationship between the two variables, in so far as this 
relationship may be described by a straight line. 

4. If an equation describing the regression of X on 7 
(X being dependent) is desired, the proper values may 
be substituted in the two normal equations 

2(X) = Ya + 6S(F) 

S(ZF) = aS(F) + h2{Y^), 

The equation secured will be of the type 

X = a + W. 

The standard error of estimate, Sx, may be computed by 
making the appropriate changes in the formula as giyen 
for Su* The value of r will be the same as in the pre- 
ceding ease, in which F is dependent. 

R. Data to be classified- ■ 

1 . Select, an " appropriate class-interval , and tabulate the. 
items in the form of a correlation table. 

: 2. Compute the' necessary values ..for fitting a, straight line 
to, the data. In doing so,- an arbitrary origin may be 
selected for each variable, ■ .and all' values expressed in 
class-interval units. .A re-arrangement in columnar .form 
may facilitate the computation of the quantity 

S(X'F0. 
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3. Compute the standard error of estimate, employine the 

formula given above. ./sc 

4. Compute the coefScient of correlation from the formula 
given above. 

5. If the above calculations were carried on in class-interval 
units, the equation of average relationship and the stand- 
ard error of estimate should now be expressed in terms of 
the original units of measurement. If an arbitrary origin 
was employed, the equation should be corrected so that 
the variables relate to deviations from the true origin. 

The Produet-Moment Method. 

A. Data to be handled as individual items. 

1. Arrange the paired observations in parallel columns and 

quantities 2(Z), S(F), S(Z=), 2(7“), 

2. Divide these quantities throughout by N. For the first 

two of these quotients we may use the symbols and 
Cy (i.e., ^ 

2(Z) 


and 


N 


S(F) 

N 


= 


Cl,)- 


Compute the mean product from the formula 
«_2(ZF) 


N 


CxCjj 


4. Compute the two standard deviations from the formulas 


N 


Cx*' 


5. Compute the coefficient of correlation from the formula 
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B. 

1 . 

2 . 

3 . 

4 . 

5 . 

6 . 


7 . 


Determine the equations of regression by substituting 
the proper values in the formulas 



X = r~~ y- 

(Note: For each of these equations the origin is at the 
point of averages.) 

If desired, transfer the origin to zero on the Iavo origiiiaJ, 
scales by substituting the arithmetic means in the 
equations 

F - F = {X - X) 

X -X = (F - F). 

<r„ 

Compute the two standard errors of estimate from the 
formulas 

Sy = 

Sx == (fx^l — 


Data to be classified. 


Construct a correlation table as in 1. B. above. 

Select an assumed mean for each variable. Measure the 
deviations of the various items from the assumed means 
ill class-interval units. 

Compute Ox and Cy in class-interval units. 

Compute (Tx and (Ty in class-interval units. 

Compute S(a:V0 in class-interval units for each compart- 
ment, of the correlation table. Total these figures to get, 
^{x-Y) for the whole table. 

Determine the value of the mean product in class-interval 
units from the formula. 


Wy') 

Compute r from the formula 


P 
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8. Reduce (Ta: and 0 * 2 , to original units. 

9. Determine the equations of regression by substituting 

the proper values in the formulas ^ 

(Ty 

y = T-^x 
, tTx 

and 



10. If desired, transfer the origin to zero on the two original 
scales from the formulas 

7- f J) 


X - X ^ r^(Y - 7)- 

<^ v ■ ■ 

11. Compute the two standard errors of estimate from the 
formulas 

8y = ay \/ 1 — r2 

Sx — (TxV 1 — 

It is advisable, m all cases, to construct scatter diagrams 
and to plot the lines of regression thereon. It is generally 
possible to derive from such diagrams a truer idea of the 
relations involved, and of the adequacy of the methods 

employed, than may be obtained from a study of the figures 
alone. 


LIMITATIONS 

A question naturally arises as to the degree of generality 
attaching to the measures of relationship described in the 
precedmg pages. Are they limited to certain types of dis- 
tributions, or may they be employed as absolutely general 
and universally valid measures? 

^ we have seen, the standard deviation has a precise and 
definite meaning with respect to distributions following the 


LIMITATIONS 


371 


normal law. Having values of the mean and of the standard 
deviation, we know, in such cases, the exact percentage 
of observations which will fall within any stated limits. 
If the distribution departs from the normal type the standard 
deviation is still a useful measure, but it cannot be inter- 
preted in the same exact sense. Bearing this in mind, the 
formula 



may be considered. 

When the distribution of the original values of the 
dependent variable about their mean is normal and the 
distribution about the least squares line is noraial, both 
Sv and ffy have specific and exact meanings, and it is per- 
fectly legitimate to compute such a measure as r, based 
upon the I’elation of one to the other. Departures from 
normality in either case reduce the significance of this 
comparison. But we have seen that the standard deviation 
remains a useful measure even though the departure from 
the normal type be fairly pronounced, though in the latter 
case it lacks the precise significance attaching to it in a 
normal distribution. In the same way the standard error of 
estimate and the coefficient of correlation may be computed 
and utilized, even when all the requirements of normality are 
not met. Care must be taken in their interpretation in 
such cases, however. It must be clearly recognized that 
these measures have their full significance only in cases 
where the original distribution of the dependent variable 
and the distribution about the least squares line are both 
normal, or approximately so. 

A simple example may make clear the effect upon the 
value of the coefficient of correlation of an extreme departure 
from a normal distribution. In the following table are 
listed certain selected figures taken from the 1919 Census 
of Manufactures, for the State of New York. 
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Table 93 


Wage-Earners in Factories and Value of Products, 1919, in Eleven 
Cities in the State of New York 



Number of wage- 

Total mine of 

City 

earners {in 

products {in 


thousands) 

'miMons of doUan) 


(X) 

(Y) 

Batavia 

2.2 

9 

Beacon 

2.2 

■ 10 

Corning 

3.5 

11 

Geneva 

2.5 

10 

Glens Falls 

2.8 

12 

Ithaca 

1.7 

10 

Middletown 

2.2 

10 

Peekskill 

2,1 

11 

Rensselaer 

1.4 

10 

Tonawanda 

1.8 

16 

New York City 

038.8 

5,261 


When the first ten of these cities, in the order listed, are 
treated as a group, the following values are secured: 

Cy — I . 8682 

,S'„ = 1.8669 
r = - .034. 

(No general significance is to be attached to the above 
coefficient of correlation, for the cities were selected for the 
purpose of illustrating a particular point.) 

The ten points and the line of regression are plotted in 
Fig. 74. 

When New York City is included in the group, the values 
secured for the sample of eleven cities are 

or„ = 1509.3 
S» = 7.53 
r = + .999988. 

The eleven points and the line of regression are plotted in 
Fig. 75. 

The reason for the markedly different results is obvious. 
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When the one very large city is included with the ten small 
cities the standard deviations of both variables are greatly 
increased. That of the F-variable (value of products) is 
increased from 1.8682 to 1509.3. But Sy, the measure of 
the scatter about the fitted line, undergoes no such pro- 


Millions 



1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 
Thousands of Wage Earners 


Fig. 74. — Showing the Relation between Number of Wage-Earners in 
Factories and Value of Products in Ten Selected Cities in the State of 
New York 

nouneed change in value. Fo,r the ten cities it is 1.8669; 
for the eleven cities 7.53. This is due to the fact that the 
one exceptional case is given such a great weight, in fitting 
by the method of least squares, that the fitted line must 
pass through or very near the point representing this 
observation. Accordingly, iS is always affected less than 
(T by a single very exceptional case. Since the value of r 
depends upon the relationship 



the presence of such a case always tends to increase the 
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value of the measure of correlation. The introduction of 
the one exceptional ease in the above example changes a 
correlation coefficient of virtually zero to one of unity. 
The result, of course, is meaningless. 

While this example represents an extreme instance, the 
same distortion will be felt, to a greater or less degree, 
whenever there is a departure from a normal distribution. 

Millions 

of 



Thousands of Wage Earners 


Fig. 75. — Showing the Relation between Number of Wage-Earners in 
Factories and Value of Products in Eleven Selected Cities in the State of 
New York 

In practice the various measures of relationship cannot 
be restricted to perfectly normal distributions; but they 
must be interpreted with care when there is reason to believe 
that such disturbing influences are present. 

The Coefficient , of Rank CoEESiiATiON 

The coefficient of rank correlation is a measure of rela- 
tionship not subject to the distortion introduced by material 
departures from normality, and one which is particularly 


RANK CORRELATION 375 

useful in providing an objective test of the existence of 
con elation. Its application calls merely for the orderly 
ranking of observations. Thus we may rank 47 states of 
the unions according to the number of individual income 
tax returns in 1934, and according to the number of pass- 

enger automobiles registered in that year. The results are 

shown m Table 94. 


Table 94 


Illustrating the Computation of the Coefficient of Rank Correlation 


( 1 ) 


Sta.te 


Nevada 
Wyoming 
New; Mexico 
S. Dakota 
Idaho 
N. Dakota 
Vermont 
Delaware 
Arizona 
Utah , 
Mississippi 
Arkansas 


( 2 ) 

Ea7ik on basis of 
number of indi- 
tridual income 
tax rekmis in 
1934 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 


(3) 

Rank on basis of 
nmriber of paa- 
miger automo- 
biles registered 
in 1934 

1 

3 

4 

15 
9 

12 

5 
2 

6 
7 

13 

16 


(4) 

Di^ereme' 
(2) - (3) 
d 


(5) 


^ Washington is excluded, because the published 
state include those of Alaska. 

Following are the records for the nine states not 

No. of taxable personal 


0 

- 1 
- 1 
- 11 
- 4 
_ 0 
2 
6 
3 

3 
2 

4 


+ 

+ 

-j- 

+ 


California 

Illinois 

Massachusetts 
Michigan . 
New Jersey: 
New York 
Ohio 

Pennsylvania 

Terns 


incomes {in thou- 
sands) 1934 

316 
310 
243 
139 
211 
808 
210 
342 
119 


0 
1 
1 

121 
16 
36 
4 
36 
9 
9 
4 
16 

income tax returns for that 
listed in Table 86. 

, No. of passenger atiio- 
mobiles registered (in 
thousands) 1934 

. 1,769 
1,282 . 

687 
1,026 
741 
1,971 
' ' 1,453 

T ,466 
1,086 
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Table 94 — Continued 

Illustrating the Computation of the Coefficient of Rank Correlation 


(1) 

(2) 

Rank on basis of 
number of indh 

State 

vidiial income 
tax returns in 
1934 

S. Carolina 

13 

New Hampshire 

14 

Montana 

15 

Maine 

16 

Alabama 

17 

Nebraska 

IS 

Oregon 

19 

W. Virginia 

20 

Colorado 

21 

Rhode Island 

22 

N. Carolina 

23 

Florida 

24 

Kentucky 

25 

Kansas 

26 

Louisiana 

27 

Tennessee 

28 

Georgia 

29 

Oklahoma 

30 

Virginia 

31 

Iowa 

32 

Minnesota 

33 

Indiana 

34 

Maryland 

35 

Connecticut 

36 

■Wisconsin 

37 

Missouri 

38 

Texas 

39 

Michigan 

40 

Ohio 

41 

New Jersey 

42 

Massachusetts 

43 

Illinois 

44 ' 

California 

■ .'45 ;■ 

Pennsylvania 

46 

New York 

■.47'"" 


(3) 

(4) 

(5) 

Rank on basis of 

number of pas- 

Difference. 


senger automo- 

(2) - (3) 
d 


biles registered 


m 1934 

18 

- 5 

25 

8 

+ 6 

36 

10 

+ 5 

25 

14 

+ '2 

4 

19 

2 

4 

30 

12 

144 

21 

- 2 

4 

17 

+ 3 

9 

22 

- 1 

1 

11 

+ 11 

121 

31 

- 8 

■64 

23 

+ 1 

1 

25 

0 

0 

33 

- 7 

49 

20 

+ 7 

49 

26 

+ 2 

4 

29 

0 

0 

32 

— 2 

' 4 

28 

+ 3 

9. 

35 

~ 3 

.: 9 ■■■' 

36 

- 3 

O'"."" 

38 

- 4 

.16. ' 

24 ' 

"+ 11 ' ^ 

121 . 

27 

+ '9 .■/ 

.SI' 

34 

+ 3 

. ■ §:;.■ 

37 


: 1. 

42 , 

-- 3'. ■ 

'.'■'" 9".,...;'..: 

41 

1 

■^'■■^ 'I-"' 

44 . ' ' 

.3 ' , 

'■ '.'.i: ■■ 

.40 

+ 2 ^ ^ ■■ 


39..' 

+ 4 : 


43 ^ : 

■■ +V1 - 

■■■..■.. 'I,'.., '■ 

46 

- :1 ■■;.'■■.■' 

1 

.■:45 \ 

+ ■.4 

'■.■■I.":''':. 

47:":.' 

■"'v 0 
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Ihe degree of correlation is indicated by the degree of 
concordance between the two rankings. A precise measure 
of correlation is provided by the coefficient 

— 11 

where d is a difference between the rankings of a given 
state m columns (2) and (3), and n is the number of states 
included. (The Greek letter rho (p) with subscript r is used 
as the symbol of this coefficient.) 

The method of computation is shown in Table 94. Prom 

tllG lXI6ciSl,l,rGlXl.GlltS iilioro WG l]i,9jVG 


1 _ 6 X 1,094 , 

■*- r= j_ 


.94. 


(47) » - 47 


6,564 

103,776 


of the spua.. of the deviations the^-st^n^ 

is equal to 

12 

If we let d equal the dififerenoe between the rank of one variable «nH the 
correspondmg rank of the other, we have, for any given pair of 

tf IP Y - X — y (since the means of the two 
series of ranks are identical) 

2 ,i-> = 2 ( 3 . _ yy. = 2 j,s ^ 2 j,s - 2Zxy 

2Sxy = 2a:= + 2 ^= _ srfs. 

— n 


But 


12 


Therefore 2Zxy =? 


- and = 
2n^ - 2n 


12 


12 


Xxy 


■ 1(Vl ' 
'2V 1 


> n 


But pj. 


- 2^2^ 


'txy 

(the productrEioment formula for rl 
" ^ 


1 
2 
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The coefficient of rank correlation is appropriate where 
individuals, or other entities, on the 
basis of abilities or qualities not open to exact measurement 
It IS also well adapted for use where the distributions of 
the observations depart widely from the normal type and 
te usefulness of customary measurements would 

be seriously impaired. This point takes on particulaUm 
portance m connection with tests of significance, involving 
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CHAPTER XI 


THE MEASUREMENT OF RELATIONSHIP 
BETWEEN TIME SERIES 

The methods of measuring correlation described in the 
preceding chapter were devised originally for the analysis 
of non-historical data, that is, for the treatment of frequency 
series rather than time series. The measurement of corre- 
lation between series in time presents certain distinctive 
problems which require separate treatment. 

We have seen that such series are affected by various 
forces, which have been classified as the secular trend, 
cyclical and seasonal fluctuations and accidental variations, 
and methods have been described by means of which the 
effects of these various forces may be isolated. This breaking 
up of a series into its component parts for separate study 
is essential in attempting to correlate series in time, for 
spurious and quite misleading results will be secured if 
this is not done. The problem of correlation is that of 
securing a precise measure of the degree of relationship 
between variable quantities. But each series in time repre- 
sents the combination of a number of variables and, so 
far as possible, each should be treated separately in corre- 
lating such series. 

The relationship between two time series as, for example, 
interest rates and bond prices, may be studied with respect 
to any or all of the following components: 

a. Secular trend. 

h. Cyclical fluctuations. 

c. Seasonal fluctuations. 

d. Changes from one time unit to the next (e.g., week to week, 
month to month, or year to year). 

380 
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Such relationships may be studied, first, through the 
comparison of graphs, and much may be learned by this 
simple process. The similarity or dissimilarity of secular 
trends, and the general relation between cyclical movements 
may be determined by a study of such graphs. For more 
accurate comparison the coefficient of correlation may be 
used, but when it is so employed it is particularly impor- 
tant that the precise nature of its employment and the 
exact significance of the results be understood. 

For the comparison of secular trends the coefficient of 
correlation would never be employed. The mere fact that 
two series have the same secular trend is no indication 
of a relationship of interdependence; a coefficient of correla- 
tion based upon the trend values would be meaningless. 
Moreover, much simpler methods are available for comparing 
trends. 

For the same reason a coefficient of correlation should 
not be based upon the original absolute values of two 
series in time, except in the rather rare case in which neither 
series is marked by a definite secular trend. The computa- 
tion of r, when dealing with ordinary statistical data, 
involves measuring the deviations of all the items from 
their respective arithmetic means, and securing the sum 
of the products of the paired deviations. T^Tien deviations 
of like sign are paired throughout r will have a positive 
value ; when deviations of unlike signs are paired throughout 
T will have a high negative value. The presence of pro- 
nounced rising or declining secular trends makes it impossi- 
ble to secure significant values for r by the employment of 
this method. For example, the relation between automobile 
production and the price of bacon between the years 1900 
and 1920 might be measured. The secular trend is markedly 
rising in each case. When the deviations of the annual 
figures are measured from the arithmetic means of the 
two series, the paired items for the earlier years will be 
negative, for the later years positive. A fairly high positive 
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value for r would be secured, were the computation carried 
through on this basis. This value would be quite misleading, 
for no real relationship can be expected in this case. The 
coefficient of correlation in such a case would measure, 
primarily, the relation between the two secular trends. 

This coefficient might conceivably be employed to deter- 
mine the similarity between seasonal fluctuations in two 
series, but its utility for this purpose may be questioned. 
Here again other and simpler methods ai-e available. 

In practice, therefore, the device of correlation should 
be employed neither to measure the relation between secular 
trends nor between seasonal movements. Its use is confined 
to comparisons of two or more series with respect to cyclical 
fluctuations and with respect to the short time changes 
from month to month or year to year. And, if valid measures 
of correlation are to be secured in making such comparisons, 
the effects of forces which distort these comparisons should 
be eliminated, in so far as this is possible. The actual work 
of correlation must be preceded by a sifting process designed 
to remove such irrelevant material. Unless the data are 
thus “distilled” the interpretation of the resulting coeffi- 
cients will be difficult. 

The Meastjrement of Correlation between Cyclical 
Fluctuations 

In an earlier chapter we have dealt with methods by 
which the effects of certain of the factors affecting time 
series might be measured and eliminated. The spurious 
correlation due to secular trend may be avoided by measur- 
ing the deviations of the observations not from the respec- 
tive arithmetic averages but from the lines of secular trend 
of the two series. These variations, the deviations from 
trend, are the significant values if our interest centers in the 
cycles. If annual values are employed the problem of elim- 
inating seasonal fluctuations is not faced. 

To illustrate this method of measuring the relationship 
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between series in time we may undertake to determine 
whether there is any coimection between cyclical fluctuations 
in cotton production and in cotton prices. Figures for 
crop years are to be employed, for the period 1901-02 to 
1935-36. 

Cotton prices require some correction before correlation 
is attempted. The raw figures with which the investigation 



Fig. 76. — Cotton Production in the United States, Crop Years 1901-1902 
to 1935-1936, with Line of Trend 


starts are average spot prices at New York for middling 
upland cotton, at wholesale, from September to May of 
each crop year. But such prices reflect not only the effects 
of varying conditions in the cotton market, but also changes 
in the general level of prices. To eliminate the effect of 
this factor the original prices are deflated by Bradstreet’s 
price index, as computed for the September-May period in 
each crop year. For this purpose, Bradstreet’s index has 
been reduced to relative terms, with the average for the 



Table 95 



( 1 ) 


Crop 

year 


1901 - 02 

1902 - 03 

1903 - 04 

1904 - 05 

1905 - 06 

1906 - 07 

1907 - 08 

1908 - 09 

1909 - 10 

1910 - H 

1911 - 12 

1912 - 13 

1913 - 14 

1914 - 15 

1915 - 16 

1916 - 17 

1917 - 18 

1918 - 19 

1919 - 20 

1920 - 21 

1921 - 22 

1922 - 23 

1923 - 24 

1924 - 25 

1925 - 26 

1926 - 27 

1927 - 28 

1928 - 29 

1929 - 30 

1930 - 31 

1931 - 32 

1932 - 33 

1933 - 34 

1934 - 35 

1935 - 36 


Cotton Production and Cotton Prices, 1901-1936 


( 2 ) ( 3 ) 

Cotton prices. 
Cotton prodm- of ^pot 

tion in United 1 • 

Stake, excluding rntMling 

linters (in thou- eotton, 

sands of bales) ^ 

{in cents per 

pound) 


( 4 ) ( 5 ) 


Bradstreet’s „ 

■price index, '-otton prices, 
avm'age, Sept. . ^•ofloted 
to May {'•'n cents per 

(1913-14 = 100 ) pound) 


9,510 

10,631 

9,851 

13,438 

10,575 

13,274 

11,107 

13,242 

10,005 

11,609 

15,693 

13,703 

14,156 

16,135 

11,192 

11,450 

11,302 

12,041 

11,421 

13,440 

7,954 

9,762 

10,140 

13,628 

16,104 

17,977 

12,956 

14,478 

14,825 

13,932 

17,096 

13,002 

13,047 

9,636 

10,443 


8.64 

9.50 

13.20 

8.69 

11.40 
10.97 

11.41 
9.81 

14.62 
14.80 

10.34 

12.35 
13.40 
8.63 

12.04 

18.29 

29.96 

30.06 

38.63 
16.90 
18.67 
26.26 
31.79 
24.34 
20.60 
14.26 
20.19 
20.02 
17.00 
10.47 

6.42 

6.75 

10.96 

12.42 

11.50 

384 


86.2 

90.0 

88.6 

89.3 

92.3 

98.8 

93.2 

91.3 
100.8 

97.8 
100.0 

104.8 
100.0 

105.2 

121.2 

151.0 

197.9 

203.1 
226.3 

152.9 

127.2 

149.7 

145.3 

150.4 

103 . 8 

141.0 

149.2 

145.1 

132.1 

107.6 

86.1 

76.2 

100.6 
106.7 
112.6 


10.02 

10.56 

14.90 

9.73 

12.35 

11.10 

12.24 

10.74 

14.53 

15.13 
10.34 
11.78 
13.40 
8.20 
9.93 

12.11 

15. 14 
14.80 
17.07 


11 

.05 

14 

M 

17 

.54 

21 

.88 

16 

. 18 

13 

.39 

, 10 
1 9 

.11 

la . 

13. 

. Od 

,80 

12, 

,87 

9. 

74 

7. 

,46, 

.■ .8, 

M: 

.10. 

89 : 

11. 

.64^: 

10.: 

29^^': 
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crop year 1913-14 equal to 100. The original figures for 
the two series to be correlated, together with the corrected 
price figures, are given in Table 95. 

These data are plotted in Figs. 76 and 77. Lines of trend 
fitted to the two series are shown on the charts. ^ 



1901-1902 to 1935-1936, with Line of Trend. (Figures relate to average 
annual prices, during crop years, deflated by Bradstreet’s index of whole- 
sale prices) 

The deviation of each annual item from the secular trend 
of the given series is now to be measured, and the coefficient 
of correlation between these deviations is to be calculated. 
The computations appear in Table 96. 

This value of — .648 for the coeflacient indicates a fair 
degree of negative correlation between deviations of cotton 
production in the United States from the line of trend and 

' The equation to the line of trend of cotton production is 

r = 13,009.14 -B 87.96Z - 4.640Z“ - .tmX\ with origin at 1918-19. 

The trend equation for deflated cotton prices is 
F= 13.96 + .152Z - .01425Z= - .00083X®, with origin at 1918-19. 



■ 1ablE',9& ' ■ 

of Coeff^^jccn^tior,, Coitan Prcd„ai,n 

totton Prices «'« 

® (3) (4) (5) /PY 

Demotion of Deviation of 

cotton pro- deflated cot- 


1901- 02 

1902- 03 

1903- 04 

1904- 05 

1905- 06 

1906- 07 

1907- 08 

1908- 09 

1909- 10 

1910- 11 

1911- 12 

1912- 13 

1913- 14 

1914- 15 

1915- 16 

1916- 17 

1917- 18 

1918- 19 

1919- 20 

1920- 21 

1921- 22 

1922- 23 

1923- 24 

1924- 25 

1925- 26 

1926- 27 

1927- 28 

1928- 29 

1929- 30 

1930- 31 

1931- 32 

1932- 33 

1933- 34 

1934- 35 

1935- 36 


duciion from 
trend {in 
1,000’s of 
bales) 

X 

- 1,395 

- 393 

- 1,298 
+ 2,161 

- 834 
+ 1,731 

- 572 
+ 1,428 

- 1,945 

- 476 
+ 3,476 
+ 1,357 
+ 1,648 
+ 3,542 

- 1,516 

- 1,366 

- 1,615 

- 968 

- 1,671 
+ 275 

- 5,273 

- 3,515 

- 3,174 
+ 290 
+ 2,758 
+ 4,637 

- 360 

+ 1,202 ^ 

+ 1,608 H 

+ 793 - 

+ 4,055 
+ 80 
+ 265 -f 

- 2,982 4 

- 1,988 4 


m ton prices 
from trend 
{in cents 
per lb.) 

y 

- 1.32 

- .72 
+ 3.63 

- 1.59 
+ .95 

- .42 
+ .57 

- 1.11 
+ 2.49 
+ 2.87 
-2.14 

- .93 
+ .45 

- 4.98 
-3.47 

- 1.50 
+ 1.35 
+ .84 
+ 2.97 
-3.15 

+ .41 : 

+ 3.25 
+ 7.62 
+ 2.00 

- .65 

- 3 .73 5 

- .04 

+ .58 
+ .07 
-2.56 
-4.24 I 

-2.17 

+ .66 
+ 2.30 I 

+ 1.94 ; 



yi 

1,946,025 

1.7424 

154,449 

.5184 

1,684,804 

13.1769 

4,669,921 

2.5281 

695,556 

.9025 

2,996,361 

. 1764 

327,184 

.3249 

2,039,184 

1.2321 

3,783,025 

6.2001 

226,576 

8.2369 

12,082,576 

4.5796 

1,841,449 

.8649 

2,715,904 

.2025 

12,545,764 : 

24.8004 


2,298,256 

1,865,956 

2,608,225 

937,024 

2,792,241 

75,625 

27,804,529 

12,355,225 

10,074,276 


12.0409 
2.2500 
1.8225 
.7056 
8.8209 
9.9225 
. 1681 
10.5625 
58.0644 


84,100 

4.0000 

7,606,564 

.4225 

21,501,769 

13.9129 

129,600 

.0016 

1,444,804 

.3364 

2,585,664 

.0049 

628,849 

6.5536 

16,443,025 

17.9776 

6,400 

4.7089 

70,225 

.4356 

8,892,324 

5.2900 

3,952,144 

3.7636 

171,865,603 220m 


xy 

t + 1,841.40 
^ + 282.96 

* - 4,711.74 

- 3,435.99 

792,30 

727.02 

- 326.04 

- 1,585.08 
4,843.05 

- 1,366.12 

- 7,438.64 

- 1,262.01 

+ 741.60 

- 17,639.16 
+ 5,260.52 
+ 2,049.00 

- 2,180.25 

- 813.12 

- 4,962.87 

- 866,25 

- 2,161.93 

- 11,423.75 

- 24,185.88 

+ 580.00 

-1,792.70 

- 17,296.01 

+ 14.40 

+ 697. 16 

+ 112.56 

- 2,030.08 

- 17,193.20 

- 173.60 
f 174.90 

- 6,858.60 

- 3,856.7 2 
128,167.61 
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Table 96 — Continued 

Computation of Coefficient of Correlation, Cotton ProdMciion and 

Cotton Prices 


ffz — 

r = 




171,865,603 


35 


^ 2,216. 


227.2511 


35 

128,167.61 


= 2.548 


Sg?/ ^ 

35 X 2,216.0 X 2.548 


0 


- .648. 


the corresponding deviations of cotton prices in New York, 
during the period covered. 

From the values already computed we may derive an 
equation for estimating the variation in cotton price associ- 
ated with a given variation in production. This regression 
equation, as we have seen, is of the type 



In the present case y and x refer to deviations from the 
parabolic lines of trend. Substituting the given values, we 
have 


y = - 


.648 


2.548 
2,216 ^ 


y = — .00074a:. 


This equation means that, on the average, a unit devia- 
tion of cotton production (a:) above the line of trend was 
accompanied by a deviation of . 00074 units in cotton prices 
iy) below the line of trend. The unit employed in the 
production figures was 1,000 bales, in the deflated price 
figures, one cent. In the interpretation of the equation 
it may be simpler to use an x-unit of one million bales, 
making the equation of regression 


y — — .74x. 

Thus a cotton crop one million bales above trend was 
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aecompamed by prices about three quarters of a cent npr 
pound betow trend (with reference always to ItS 
prices). This was the average relationship duriWiK 
period 1901 - 1936 . It did not hold in all cases 
by the fact that r has a value of but - 648 If tto 
a similar law, held perfectly, r would have a value of - i 
The value of S, which measures the scatter about the 
line of regression, may be computed from the formula 

== (Xys/l — . 

In the present ease, S. has a value of 1.04 cents Th. 

Si™ “ an' earlier 

(It should be emphasized that the use of thA ch 

iuid™ ;itr 

woid h s““ abstssi * 

Of a^£r :s 

arliyltaScTJ^® “S ‘^“e is an 

problems previously studied "rh ‘’*® eorrektion 

from lines of trAr. J ? r ‘ deviations are measured 

thei 7Z ot t “ d S and 

different IlL SinTuS) 

S 'give quite different results. 
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In the above example the lines of trend were both power 
curves of the tl^d degree. We might, perhaps with equal 
reason, assuine that the underlying trends are best defined 
by other functions. Coefficients of regression and correlation 
wouM have different values if this were done. The presence 
of this arbitrary dement in the correlation of deviations 
rom lines of secular trend detracts somewhat from the 
confidence that may be placed in the results. The critical 
pioi em here hes not in the mechanical process of correla- 
tmn, but in the choice of an appropriate line of trend for 
each series. If, by the tests of inspection and of corre- 
yondence with such external evidence as may be available, 
It appears that the curve selected accurately represents the 
tiend in each of the senes correlated, the coefficient may 
be accepted as significant. But, in the interpretation and 
use of the results, the presence of this element of personal 
judgment in the preliminary calculations must not be 
forgotten. This applies with particular force if the study 
aims to establish a functional relationship between cyclical 
fluctuations m the two series, and if an estimating (or 
regression) equation is to be based upon the results. 

The Coeppicient op Coeeelation and the 

Mbasueement OP Time Sequence 

_ In the correlation of cotton production and cotton prices 
to measure as accurately as possible the 
effect of variations m cotton production upon cotton prices. 
An equation was secured which described this relation 
dien cteviations were measured from the particular lines 
1 len employed. Cotton prices were considered to be a 
function of cotton production, and the object of the study 
was to measure this functional relationship. We seek, 
m such cases, to determine the degree to which cycles 
in one series depend upon or reflect cycles in a related 
serms, assuming some functional relationship between them, 
ibis IS essentially the problem described in introducing the 
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subject of correlation, and generally constitutes the major 
problem in studying the relation between series of any type. 

But a second and somewhat different problem may be 
faced in certain studies of time series. Assuming that 
two such series are marked by definite cycles, it is of 
interest to determine whether the cycles coincide in time, 
or whether cycles in one series consistently precede or lag 
behmd cycles in the other. The coefficient of correlation 
has been found very useful in determining the degree of 
“lead” or “lag” in such cases. This problem is that of 
determining merely temporal relationship, as opposed to 
the functional relationship that is ordinarily to be measmed. 

THE KELATION BETWEEN STOCK PRICE CYCLES AND CYCLES 
OP BUSINESS ACTIVITY 

To illustrate the solution of a problem of this latter type, 
we may undertake to determine the relation, in time, 
between cyclical movements in industrial stock prices and 
in general business activity, as measured by the composite 
index compiled by the American Telephone and Telegraph 
Company. The monthly values of this index for the period 
1899-1937 have been presented in an earher section. Figures 
relating to stock prices from January, 1903, to June, 1914, 
are given in Table 97. 

Table 97 

Cycles in Industrial Stock Prices, 1903-1914 ^ 

(Figures relate to deviations from trend in units of the standard deviation) 

Month 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 !914 

January - .2 ~ 2.0 - .1 + 2.3 - 1 - 1.6 - 1.3 + .6 + 1.0 ■“ .1 - .4 - .2 -- ,7 
February - .1 - 2.1 + .2 + 2.2 + 1.4 - 1.6 + .3 + .5 + .1 ~ .4 - .6 - .6 

. Mareb - .3 -2.1 + .6 +; i .9 + .6 - 1.1 + .3 + .8 - .1 - .1 - .7 - .6 

ApriL - ,5 - 2.0 + .7 + 1,7 + ,6 - .8 + .5 + .6 - .1 +.3 - .6 - ,H 

May — ,5 - 2.1 + .2 + 1.4 + .4 - .5 + .8 + .4 0 + .2 — .7 - ,H 

June - .9 -2.1 + .3 + 1 . 5 -.+ '.2 .6 + .9 + .1 + .1 + .2 

July -1.4 - 1.8 + .7 + 1,3 + .3 - ,2 + 1.1 - .4 + .1 + .2 - .9 

August - 1.7 - 1,6 + .S + 1.7 - . 3 .+ .3 +1.4 .3 . 3 ' +:.3 - .7. , : . 

September - 1.9 - 1.3 + .7 + 1 . 7 ,- .5 + .1 + 1.4 - .3 - .7 + .4 - .6 
October - 2.3 - .9 + .8 + 1.7 - 1.3 + .2 + 1.4 0 - .7 +.3 - .8 

November -2.4 - .3 + 1.1 + 1,7 - 1.9 + .5 +1.4 + .1 - .5 + .2 - .9 

Oeeember -2.1 - .1 + 1.8 + 1.7 - 1.6 + .5 + 1.3 - .2 - ,4 0 - .9 

^ These %ures, the results of analyses by W. M. Persons, are from the 
Review of Economic SiatuHcSj published by the Harvard Committee on Eco- 
nomic Research. They are based upon the average price of 12 industrial stocks. 
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The data of the two series are plotted in Fig. 78. ^ From 
a comparison of the two curves in this chart it is clear 
that there is some relation between the movements in 

Scale for ' Scale for IncI.ex of 

Index of Industrial General Business 



Fm. 78. — Comparison of Cyclical Fluctuations in Industrial Stock Prices 
and in General Business Activity, 1903-1914 

the two series, but such a comparison affords no basis 
for a definite conclusion. Our object is to determine whether 
the cycles in the two series are exactly synchronous and, 
if they are not, to measure the average time interval by 
which cycles in one series precede the cycles in another. 
The significance of such studies in the analysis of the business 
cycle is obvious. 

For the study of pre-war relations data for the period 
from January, 1903, to June, 1914, may be employed. 
A coefficient of correlation is first computed for concur- 
rent items. A value of + .55 is secured. Next, the data 
are correlated with industrial stock prices preceding general 

1 The American Teiephone and Telegrapli index here plotted is not identical 
with that given in Chapter IX.. The latter is a revised; series, differing in some 
respects from the original index ' for the pre-war period, that has been used in 
the present calculations. 
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business by one month. That is, the January, 1903, figure 
for stock prices is multiplied by the February, 1903, index 
of general business; the February stock price is multiplied 
by the March business index, etc. This process is carried 
tWugh for the entire period from January, 1903, to June, 
1914. Only 137 monthly values are used in this computa- 
tion, as compared with 138 in the preceding case, for the 
January, 1903, business index and the June, 1914, stock 
price figure do not enter into the calculations. Accordingly, 
the values Cx and Cy (the two corrections to be applied 
because the origin does not coincide with the two averages) 
and the two standard deviations will be slightly different. 
These corrections may be readily made. The coefficient 
of correlation secured from these computations has a value 
of -f .65. The same operation is repeated with other 
pairings of the two variables. The results are summarized 
below. 

Table 98 

Coefficients of Correlation between Industrial Stock Prices and an 
Index of General Business Activity 
(Based upon data for the period 1903-1914) 

Coefficient of Correlation 


Stock prices concurrent witli business index 


+ 

.55 

Stock prices preceding business index by 1 month 

■f 

.65 
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These figures are plotted in Fig. 79. 

The coefficients increase to a max imuTYi value of + . 76 
which is secured with stock prices preceding general business 
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by 4, 5, and 6 months. The stability of the coeffiLcients 
with the period of “lead” varying from 3 to 7 months 
indicates that there was no one specific interval, within 
the limits thus indicated, between the cyclical movements 
of these two series. From the results here given it would 



Lag in months 


Fig. 79. — Coefficients of Correlation between Index of Industrial Stock 
Prices and Index of Business Activity, 1903-1914, Showing the Besiilts 
Secured with Different Pairings. (In all pairings except that of concurrent 
items the business activity index follows the stock price index) 

appear that five months was the average interval by which 
stock prices preceded the general business index, but this 
was not sharply marked off as a constant relationship. . 

With this record of pre-war' relations we inay contrast the 
experience of recent years. The Index of .'Industrial Activity 
of the American Telephone and Telegraph CJoinpany, given 
in C'hapter IX, defines the state . of business. Of stock 
price index numbers, the, measurements currently published 
in the Reimw of Ecoiiomic Statistics ^ are in a form, best 

^This is not a homogeneous', series lor - the, entire period covered. For tlie 
years 1919-1924 the index is based, on the '.average price of 20 mdustrial stocks 
(the Dow-Jones index), expressed" . as deviations', from, trend in units of the 
standard deviation. For the 'period- 1925-1937 the 'official ail-inclusive index 
of the New York Stock Exchange', (index ■’No. 2) - has been used. This index, 

■ . {Fooimote 1 cmiinwd on fag& 295) 
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adapted to our present needs, although a change in coverage 
during the period detracts somewhat from their utility, for 
comparative purposes. Monthly values of this index for 
the period 1919-1937 are recorded in Table 99. The two 
series are plotted in Fig. 80. 

Table 99 


Cycles in Stock Prices, 1919-1937 ^ 


Month 

1919 

1920 

1921 1922 

1923 


1924 

1925 

1926 

1927 

1928 

Ian. 

+ 

38 

+ 1 

...44 - 

- .69 - 

.68 


.12 


- . 

47 

4- 

.15 

» 4-1.11 

+ 1.04 

+ 

2, 

,61 

Feb. 

+ 

38 

4" 

.78 - 

- .70 - 

.53 

4- 

.05 


- . 

43 

4" 

.17 

' 4- .90 

+ 1.31 

+ 

2, 

.31 

March 

+ 

70 

4-1 

..06 - 

- .72 - 

.38 

+ 

.15 

— . 

58 

— 

.21 

, 4- .33 

, + 1.25 

+ 

2 

.96 

ApriJ 

+ 

,90 

4-3 

..10 - 

- .68 - 

,18 

— 

.02 


- 

81 

— 

.12 -j- .52 

' + 1.32 

+ 

S 

.'37 

M.ay 

+ 1 

40 

4- 

.50 - 

- .67 - 

.11 

— 

.31 


- 

88 

4- 

.19 

► -f .55 

' +1.56 

4- 

3 

.38 

June 

+ 1 

76 

4- 

.46 - 

- 1.15 - 

.14 

— 

.46 


- 

82 

4- 

.27 

' -{- .81 

+ 1.39 

+ 

2 

.82 

July 

+ 2 

01 

4- 

.38 - 

- 1.20 - 

.08 

— 

.70 


- 

58 

4- 

.41 

■fl.OO 

' + 1.94 

+ 

2 

.89 

Aug. 

+ 1 

60 

4- 

.04 - 

- 1.32 4- 

.05 

— 

.65 


- 

39 

4- 

.43 

: 4-1.11 

+ 2.07 

■+ 

3 

,37 

Sept. 

+ 1 

78 

4- 

.11 - 

-1.16 4- 

.10 

— 

.70 


„ 

44 

4- 

.63 

: 4-1.13 

+ 2.42 

+ 

3 

.63 

Oct. 

+ 2 

14 

— 

.04 - 

- 1.12 -f 

.10 

— 

.84 


- 

50 

4- 

1.01 

4- .89 

+ 2.06 

+ 

3 

.69 

,'Nov. 

+ 1 

90 

— 

.44 - 

- .89 ~ 

.16 

~ 

.72 



25 

4- 

.90 

» -fl .07 

+ 2.55 

+ 

4, 

.39 

Dec. 

•f 1 

54 

- 

.85 - 

- .70 - 

.10 

— 

.60 


- 

02 

4- 

1.05 

- +1.05 

+ 2.67 

+ 

4, 

.41 

Month 

1929 

1930 

1931 

1932 

1933 


1934 

1935 

1936 

1937 

Jan. 

+ 4 

.21 

4- 

1.11 

- 1.35 

- 4. 

02 


4 

.31 

- 

- 2. 

,83 

~ 3.30 

- 1.61 



.64 

Feb. 

+ 4 

.11 

4- 

1.28 

- .83 

- 3. 

90 

— 

4 

.65 

- 

- 2 ] 

,90 

~ 3.37 

- 1.51 



,61 

M,'arch 

+ 3 

.70 

4- 

1.83 

- 1.22 

- 4. 

20 

— 

4 

.62 

_ 

- 2. 

,89 

- 3.51 

- 1.61 

— 


,65 

April 

+ 3 

.84 

4- 

1.59 

- 1.73 

~ 4. 

64 

— 

3 

.91 

_ 

- 2. 

,93 

- 3.23 

- 1.92 

— 

1 

,11 

May 

+ 3 

.17 

4* 

1.41 

~ 2.35 

-- 5. 

05 


3 

.33 


- 3. 

19 

- 3.14 

- 1.71 


1 

,17 

June 

+ 3 

.88 

4” 

.13 

- 1.85 

- 5. 

09 

__ 

2 

.90 

- 

- 3. 

13 

-■ 2.97 

- 1.61 

— 

1 

.44 

July 

+ 4 

.25 

4- 

.25 

- 2,15 

- 4. 

60 

— 

3 

.26 

_ 

- 3. 

51 

- 2.71 

- 1.31 

__ 

1 

,03 

,Aug. 

+ 4 

.89 

4- 

.24 

- 2.18 

- 3. 

86 

__ 

2, 

,89 

- 

- 3. 

37 

2.62 

- 1.27 


1 

31 

Sept, 

+ 4 

.10 


.51 

- 3.42 

- 3. 

97 

_ 

3 

.31 

- 

- 3. 

40 

- 2.55 

- 1.23 

»_ 

2 

CM 

Oct. 

+ 1 

.67 

— 

1.04 

- 3.23 

- 4. 

30 

—• 

3 

.57 

- 

-3. 

,45 

“ 2.29 

- .90 


2 

48 

Nov. 

+ 

.68 

— 

1.21 

- 3.54 

- 4. 

42 

— 

3 

.32 

_ 

- 3. 

.21 

- 2.10 

- .78 


2 

85 

Dec. 

+ 

.74 

— 

1.65 

- 3.99 

- 4. 

37 

— 

3 

.27 

- 

- 3, 

.21 

- 1.93 

- .81 


3 

03 


Results obtained from a study of the temporal relations 
between these two series, for the period 1919-1937, are 
given in Table 100. 

(Footnote 1 continued from page $BS) 

originally eoiistructed with the figure for Jan. 1, 1925 as 100, has here been 
expressed in terms of deviations from 100, in units of a standard deviation 
assumed to be equal to 15' on the original scale.. In effect,, a horizontal trend 
at the level of Jan. .1,. 1925, has been assumed tor the. Stock Excliange Index. 
This index has also been'' shifted .slightly in time. The. index figure relating 
to the first day of a given month, "in' the Stock Exchange tabulatioiiH, has here 
been recorded as for' the month preceding.' Thus a February 1st index is en- 
tered for January, a March. 1st. index .'for February,, etc. 

^ From .the Review of Economic Statistics. -The 'figures in/the table define 
deviations from trend, in units of the standard deviation, with the assumptions 
stated in the preceding footnote. The coefficients in Table 100 are based upon 
data through July, 1937, only.' : 
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Table 100 

Coeffirients of Correlation between Stock Prices and an Index of 
Business Activity 

(Based upon data for the period 1919-1937) 

Coefficient of Correlation 


Stock prices concurrent with business index 4- 85 

Stock prices preceding business index by 1 month -f . 86 


(C 

« 

it 

tt 

it 

tt 

2 months 

+ .85 

if 

a 

t( 

it 

tt 

tt 

8 


+ .83 

a 

u 

it 

tt 

tt 

ii 

4 

it 

+ .82 

a 

(i 

(( 

a 

a 

tt 

5 

a 

+ .80 

a 

it 

it 

it 

tt 

tt 

6 

tt ' 

+ .78 

K 

U' 

tt 

ii 


ii 

7' 

a 

+ .76 

(1 

i( 

, . tt 

■■ it 

u 

it 

8 

tt 

+ .74 


ii 

ii 

it 

£t 

tt 

9 

tt 

4- . 71 

a 

(( 

a 

it 

ii 

ii 

10 

tt 

+ .68 

n 

It 

it 

It 

tt 

it 

11 

ti 

+ .65, 


These measurements ai’e shown graphically in Fig. 81. 

In using these coefficients we should note that the stock 
price records for part of the recent period are different 
in important respects from those employed for the pre-war 
period. In place of the 12 industrial stocks entering into 
the earlier comparisons the index for the recent period 
included 20 stocks and, later, a comprehensive list composed 
of all varieties of stocks. The market behavior of the 
broader list may have departed somewhat from the pattern 
set by the limited number of industrial stocks. The differ- 
ence between the results for the two periods is to be inter- 
preted with this fact in mind. 

In post-war years the highest degree of con-elation pre- 
vailed ivith the business index following the stock price 
index by one month. The traditional “lead” of stock 
prices, on the basis of which the movements of these prices 
have been used as forecasters of business changes, was 
clearly reduced in this period. The actual statistical record 
we have obtained may have been affected somewhat by 
the broadening of the coverage of the stock price index 
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used, but the change in the relations between the two 
series appears to have been a real one. 

This method of measuring temporal relations between 
economic series is highly useful, but one important caution 
should be noted. The method indicates the average degree 
of lead or lag of one series, with reference to another. 
Frequently the sequences of change in economic series are 
not the same in all phases of business cycles. Thus, observa- 
tions relating to ten business cycles occurring between 1890 



Lag in Months 

Fig. 81. — Coefficients of Correlation between Index of Industrial Stock 
Price.s and Index of Business Activity, 1919-1937, Showing the Results 
Secured with Different Pairings. (In all pairings except that of concurrent 
items the business activity index follows the stock price index) 

and 1925 indicate that pig iron prices preceded the general 
index of wholesale prices by 3.4 months, on the average, 
in business recessions, but followed the general index by 
5.1 months, on the average, in periods of business revival.* 
This highly important difference would be ironed out in 
the measurement of average temporal relations by the 

‘ Of. The Behavior of Prices, New York, National Bureau of Economic 
Research, 1927, 84-87. 
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correlation method. The use of an average should be 
supplemented by a study of the itenas entering into the 
average. Study of the relations among individual observa- 
tions at different cyehcal phases is essential when correlation 
technique is employed to define time sequences among the 
movements of economic series. 

The Use of the Moving Average in Correlating 
Cycles in Time Series 

The preceding discussion has dealt only with cycles as 
measured from mathematically fitted lines of trend. But 
trend may be measured, as we have seen, by lines based 
upon moving averages, and the cyclical deviations from 
such lines may be correlated in precisely the same way as 
deviations from other lines of trend. The arithmetic 
mean of the deviations from such moving averages will 
not necessarily be zero, as in the case of deviations measured 
from lines fitted by the method of least squares, and a corre- 
sponding correction must be made in correlating such figures. 

Moving averages are subject to the same criticism as 
are mathematical lines of trend. There can be no certainty 
that deviations from lines of trend based upon moving 
averages represent the effects of cyclical causes solely. The 
result in a given case depends upon the period of the moving 
average employed, and there is no perfect criterion by 
which to determine the best measure of trend. Significant 
and useful coefficients may be computed when deviations 
are measured from moving averages, but the presence of 
an arbitrary element in the work must be recognized and 
the results applied with corresponding reservations. 

The Correlation of Short Term Fluctuations 

In describing the variable factors that constitute compo- 
nent elements of the values of a series in time, it was pointed 
out that the coefificient of correlation would not generally 
be employed in comparing either the secular trends or the 
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seasonal fluctuations of two series. It may be used to 
advantage in measuring either functional or temporal rela- 
tions between cyclical fluctuations, provided that the effects 
of the other variables have been, so far as possible, elimi- 
nated. The coefficient of correlation and the measures 
which are employed in conjunction with it have a further 
use in dealing with time series. They may be used to meas- 
ure the relation between short term changes in two series 
changes from year to year, month to month, or even from 
week to week or day to day, if desired. This problem is 
distinct from that studied in the preceding section and in 
the interpretation of the results the two .should not be 
confused. 

There are several ways in which the problem of comparing 
short term fluctuations may be attacked. The absolute 
differences betw^een successive items in two series may be 
correlated, oi these differences may be expressed as per- 
centages or ratios. Table 101 illustrates the procedme 
employed in measuring the correlation between the absolute 
fluctuations from year to yeai- (first differences) of cotton 
production and cotton prices. The original values from 
which the items in columns (2) and (3) are derived are 
given in Table 95. 

The process of computing r is identical with that em- 
ployed in preceding examples, wRen deviations were meas- 
ured from an arbitrary origin. The arbitrary origin in 
this case is zero, but corrections must be made in the 
viirious values since the algebraic sum of the given figures 
is not zero in either case. Computations based on the 
figures in Table 101 follow: 
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Table 101 

Computation of Coefficient of Correlation between Cotton Prodvcttm, 

and Cotton Prices, 

(Based upon first differences) 


( 1 ) 


Crop 

Year 


( 2 ) 

Dijjer&rme 6e~ 
tween produc- 
tion in given 
year and pro- 
duction in pre- 


( 3 ) 

Difference be- 
tween price in 
given year and 
price in pre- 

ceding -geaf {in 
miUims of pound. 


( 4 ) 


( 5 ) 


(«) 


1902 - 03 

1903 - 04 

1904 - 05 

1905 - 06 

1906 - 07 

1907 - 08 

1908 - 09 

1909 - 10 

1910 - 11 

1911 - 12 

1912 - 13 

1913 - 14 

1914 - 15 

1915 - 16 

1916 - 17 

1917 - 18 

1918 - 19 

1919 - 20 

1920 - 21 

1921 - 22 

1922 - 23 

1923 - 24 

1924 - 25 

1925 - 26 

1926 - 27 

1927 - 28 

1928 - 29 

1929 - 30 

1930 - 31 

1931 - 32 

1932 - 33 

1933 - 34 

1934 - 35 

1935 - 36 


millions of 
bales) 

X ■ 

+ 1.121 

- ,780 
+ 3.587 
~ 2.863 
+ 2.699 

- 2.167 
+ 2. 135 
- 3.237 
+ 1.604 
+ 4.084 

- 1.990 
+ . 453 
+ 1.979 

- 4.943 
+ .258 

- .148 
+ .739 

- .620 
+ 2.019 
- 5.486 
+ 1.808 
+ .378 
+ 3.488 
+ 2.476 
+ 1.873 
- 5.021 
+ 1.522 
+ .347 

- .893 
+ 3.164 
- 4.094 
+ .045 

- 3.411 
+ .807 


deflated) 

Y 

+ .54 
+ 4,34 

- 5.17 
+ 2.62 

- 1.25 
+ 1.14 

- 1.50 
+ 3.79 
+ .60 

- 4.79 
+ 1.44 
+ 1.62 
- 5.20 
+ 1.73 
+ 2.18 
+ 3.03 

- .34 
+ 2.27 

- 6.02 
+ 3.63 
+ 2.86 
+ 4.34 
- 5.70 
- 2.79 

- 3.28 
+ 3.42 
+ .27 

- .93 

- 3,13 

- 2.28 
+ 1.39 
+ 2.04 
+ .75 

1.35 


X2 

1.256641 
.608400 
12.866569 
8.196769 
7.284601 
4.695889 
4.558225 
10.478169 
2.572816 
16.679056 
3.960100 
.205209 
3.916441 
24.433249 
.066564 
.021904 
.546121 
.384400 
4.076361 
30.096196 
3,268864 
. 142884 
12.166144 
6.130576 
3.508129 
25.210441 
2.316484 
. 120409 
.797449 
10.010896 
16.760836 
.002025 
11.634921 
.651249 


F2' 

.2916 
18.8356 
26.7289 
6.8644 
1 . 5625 
1.2996 
2.2500 
14.3641 
.3600 
22.9441 
2.0736 
2.6244 
27.0400 
2.9929 
4.7524 
9.1809 
.1156 
5.1529 
36.2404 
13.1769 
8. 1796 
18.8356 
32.4900 
7.7841 
10.7584 
11.6904 
.0729 
.8649 
9.7969 
5.1984 
1.9321 
4.1616 
. 5625 
1.8225 


XF 


+ 


+ 


+ 


+ 


+ 


.60534 
3 . 38520 
18.54470 
7.50106 
3.37375 
2.47038 
3.20250 
12.26823 
.96240 
19.56236 
2.86560 
.73386 
10.29080 
8.55139 
.56244 
.44844 
.25126 
1.40740 
12. 15438 
19.91418 
5.17088 
1 .64052 
19.88160 
6.90804 
6.14344 
17.17182 
.41091 
.32271 
2.79509 
7.21392 
5.69066 
.09180 
2.55825 
1.08946 


+ -933 - 1 - .27 229.624987 313 . 0067 ' ^ 180 . 19 ^ 
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^ ^ + .27 
V 34 
2 = .000063 


+ .00794 


^ _ ,/SF2 „ i/313. 


V 


/VJ-2 

N 
S(XF) 


0067 


CxCjj 


34 

180.19834 


V " 34 

p = - 5.300168 
= = - 5.300168 

(r.cr„ 2.599 X 3.034 
r = - .672. 


.000063 = 3.034 
(.02744 X .00794} 


The equation of regression and the value of S^, computed 
from the usual formulas, are 

y = — ■7Sx 
Sy = 2. 2o cents. 

A comparison of the different results secured in the 
preceding examples relating to cotton throws some inter- 
esting light upon the general problem of correlation. In 
fact, in the two examples, we have measured the correlation 
between measurements that are not strictly comparable — 
deviations from third degree parabolas, in the first case, 
and year-to-year fluctuations in the production and price 
of cotton, in the second. Yet, if we were seeking to estimate 
the price of cotton which would accompany a given crop, 
an estimate might be based upon either of the studies, 
the results of which are given below. 

r ' Sy 

Correlation of cycles in cotton production and 
prices (deviations measured from third degree 
parabolas) — . 648 1 . 94 cents 

Correlation of year-to-year fluctuations, same data — . 672 2.2.5 cents 

The value of r in the second example is slightly greater 
than the value secured in the first case, though the standard 
error is also larger. The reason for this apparent contradic- 
tion has been suggested above; the standard deviation of 
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the year-to-year fluctuations in cotton prices is greater 
than the standard deviation about the trend of cotton prices. 

It appears that errors of estimate are less when based 
upon the results secured when deviations from third degree 
curves are correlated than when based upon the study of 
year-to-year movements. But there is a concealed assump- 
tion in the first case, the assumption that the lines of trend 
of both prices and production may be projected beyond the 
period studied. There is an immeasurable margin of error 
in this assumption, and the standard error of estimate, 
accordingly, does not give a true measure of the probabilities 
involved. No such assumption is involved in the measure 
based upon year-to-year fluctuations. 
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CHAPTER XII 


THE MEASUREMENT OF RELATIONSHIP: 
NON-LINEAR CORRELATION 

In the preceding chapters the discussion has been confined 
to cases in which the relationship between two variables 
may be described by a straight line. The coefficient of 
correlation, r, is a measure of the degree to which two 
variables approach a linear relationship and it is signifi- 
cant only when a straight line gives a good fit to the points 
representing the paired values of X and Y. 

In fitting curves to time series, as explained in an earlier 
section, it is found that in many cases the trend is non- 
linear, and that a curve of higher degree is needed. The 
same thing is true in the field of our present discussion. 
It is possible to have a high degree of correlation between 
two variables when a straight line does not describe the 
relationship. In such a case there would be considerable 
scatter about the straight line of best fit, and the value 
of r would be misleadingly low. If a curve representing 
the real relationship could be fitted, the scatter would 
be materially reduced and the true correlation could be 
measured. The figures presented in Table 102 illustrate 
such a case. These data are plotted in Fig. 82. 

Two different curves have been fitted to the points 
plotted in this figure. One is a straight line having the 
equation 

Y = 6.038 -f .0886Z 

in which Y represents yield, in tons per acre, and X repre- 
sents depth of irrigation water applied, in inches. The 
degree of relationship between the two variables, as de- 
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scribed by this line, is indicated by the coefficient of 
correlation, r, which has a value of + . 69. 

Table 102 

Alfalfa Yield avd Irrigation 
Summary of investigations at Davis, California ^ 

(The measurements in the body of the table measure yields, In tons per acre, 
in 44 experiments) 

Inches of irrigation water applied 
0 12 18 24 30 36 48 60 

2.35 4.31 5.69 6.00 7.53 7.58 8.05 5.55 

2.75 4.78 6.46 6.89 7.97 8.22 8.45 7.25 

2.89 4.84 7.02 7.96 8.32 8.63 8.63 10.17 

3.85 5.83 8.02 8.32 9.43 9.33 8.83 10.70 

5.52 6.51 8.38 9.54 9.38 9.52 

Average ii?! 11-06 12.48 10.62 

yield 3.88 5.63 6.80 7.92 8.98 9.27 ‘ 9.02 8^42 7.48 

An inspection of the figure shows clearly that the straight 
line does not give the best possible fit. It is certain, there- 
fore, that r does not furnish a valid measure of the degree 
of relationship between alfalfa yield and depth of irrigation 
water. 

Parabolic Relationship 

The other curve in Fig. 82 is a second degree parabola, 
fitted by the method of least squares. The equation to this 
curve is 

F = 3.539 + .2527Z - .002827Z2. 

It is obvious that the effect of increasing irrigation upon 
alfalfa yield is described much more accurately by this 
latter curve than by the straight line. The most important 
result of these investigations was the determination of the 
point at which alfalfa yield began to fall off with increased 
applications of water, and the straight line fails to indicate 
any such decline. 

' This table is taken from “The Economical Irrigation of Alfalfa in the Sac- 

ramento Valley” by S. H. Beckett and"E. D. .Robertson, Bull. Mo, BSOf 
Agi'icultiiral Experiment Station, Univ. of . California, May, 1917. 
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As the equation of relationship, therefore, we should use 
the parabolic rather than the linear form. The standard 
error, Sy, which is a necessary acconapanying measure, may 
be calculated by measuring the deviation of each value 
from the corresponding computed value, and determining 



Fig. 82. — Scatter Diagram Showing the Relation between Alfalfa Yield 
and Irrigation Water Applied, with Two Lines of Regression 


the root-mean-square of these deviations. This procedure 
is illustrated in Table 103. The figures for normal yield 
which are given in this table are computed from the parabolic 
equation given above. 

Inserting the sum of the squared deviations, as given in 
col. (5) of Table 103, in the formula 

we have 






80.9945 


1.36. 


44 
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Table 103 

Comparison of Actual and Computed Alfalfa Yield 


(1) 

( 2 ) 

( 3 ) 

.(4) 

(5) 

Depth of 


Normal yield, 

Deviation of 

irftgaiioii 

Actual yield 

as computed 

actual f rom 


wder 

from parabolic 

normal 



equation 

(2) -- (3) 


X 

Y 

Yo 

d 


0 

3.85 

3.54 

+ .31 

.0961 

0 

5.94 

3.54 

+ 2.40 

5.7600 

0 

5.52 

3.54 

+ 1.98 

3.0204 

0 

2.75 

3.54 

- .79 

. 6241 

0 

2.89 

3.54 

- .65 

.4225 

0 

2.35 

3.54 

- 1.19 

1.4161 

12 

12 

4.78 

7.52 

6.16 

6.16 

- 1.38 
+ 1.36 

1.9044 
' 1.8496 

12. 

6.51 

6.16 

+ .35 

. 1225 

12 

4.31 

6.16 

- 1.85 

3.4225 

12 

5.83 

6.16 

- .e33 

. 1089 

12 

4.84 

6.16 

- 1.32 

1.7424 

18 

7.02 

7.17 

- .15 

.0225 

18 

5.69 

7.17 

- 1.48 

2. 1904 

18 

8.02 

7.17 

+ .85 

. 7225 

18 

6,46 

7.17 

- .71 

.5041 

24 

6.00 

7.98 

-1.98 

3.9204 

24 

8.38 

7.98 

-f .40 

. 1600 

24 

8.32 

7.98 

+ .34 

. 1156 

24 

6.89 

7.98 

- 1.09 

1 . 1881 

24 ■ 

9.96 

7.98 

4- 1.98 

3.9204 

. : . 24 

7.96 

7.98 

- .02 

.0004 

30 

7.53 

8.58 

1.05 

1 . 1025 

; .30 

9.54 

8.58 

-4- .96 

.9216 

30 

9.43 

8.58 

+ .85 

.7225.' 

, 30, ■ 

7.97 

8.58 

- .61 

. 3721 

30 

11.06 

8.58 

+ 2.48 ■ 

6. L504 

30 , 

8.32 

8.58 

- .26 

.0676 

36 

7.58 

8.97 


1.9321 

, . 3f> 

9.33 

8.97 

+ .36 

.1296 

36 

9.38 

8.97 

+ .41 , 

.,1681, . 

36 

8.22 

8.97 

-* . 75 ' 

■ .5625, 

36 

12.48 

8.97 

■ +3.51 : 

12.3201 

36 

8.63 

8.97 

>. .34' . , 

..1156:,^ 

48 

8.45 

9.16 


. .5041' 

48 

9.52 

9.16 

.+ ,.36 : 

, 1296 


{Continmd on next page) 
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Table 103 (Coniimted) 

Comparison of Actual and Computed Alfalfa Yield 


(1) 

(2) 

(3) 

(4) 

(5) 

Depth of 
irngation 
water 


Normal yield 

Deviation of 


Actual yield 

as computed 
from parabolic 

actual from 
normal 



equation 

(2)~(3) 


X" 

Y 

Yc 

d 

# 

48 

8.63 

9.16 

- .53 

.2809 

48 

8.83 

9.16 

- .33 

.1089 

48 

10,62 

9.16 

+ 1.46 

2.1316 

48 

8.05 

9.16 

~ 1.11 

1.2321 

60 

10.17 

8.52 

-f 1.65 

2.7225 

60 

7.25 

8.52 

- 1,27 

1.6129 

60 

10.70 

8.52 

+ 2.18 

4.7524 

60 

5.55 

8.52 

-2.97 

8.8209 


80.9945 


THE INDEX OP CORRELATION 

We need now the third value, the abstract measure of 
degree of relationship. In dealing with cases of linear 
relationship in the preceding chapter we found that such 
a measure, the coefficient of correlation, could be derived 
from known values of Sy and Uy. An analogous measure 
may be derived in the same way in eases of non-linear 
relationship, such as that found in the present problem. 
Since the term coefficient of correlation axid the symbol r 
refer only to cases of linear regression, we may term this 
general measure the index of correlation, and use the symbol 
p (rho) to represent it. 

As a general formula for the index of correlation we 
have* 

^ With X dependent this formula becomes 

The fct of the two subscripts refers always to the dependent variable, the 
second to the independent. It is essential that these be shown, for p would 
not necessarily be the same with X dependent as with Y dependent. Such a 
distinction is not necessary in the case of linear correlation, for r is the same 
no matter which variable be dependent. 
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The value of Sy has been derived above. The value of 
ffy, computed by familiar methods, is found to be 2.27. 
Substituting in the formula for p, we have 



= .80. 


This value is materially greater than that of the coefficient 
of correlation for the same data. The value of r is 4- 69. 
The difference is due to the fact that the second degree 
parabola constitutes a much better fit to the data than 
the straight line. The correlation is distinctly non-linear, 
and r is an inappropriate measure of correlation. 

THE MEANING OF THE INDEX OP CORBBLATION 

It is important that the significance and the limitations 
of p be understood. Its value depends upon the relation 
between the scatter about the fitted line and the scatter 
about the arithmetic mean of the F’s. In the case of a 
straight line, p and r are identical, r being a special case 
of p. The limits of p are 0 and 1, a value of 0 indicating 
that there is no relationship, or that if there is a relation- 
ship between the two variables it cannot be described by the 
particular equation employed. A value of 1 indicates that 
the relationship, as described by the equation employed, 
is a perfect one. For curves of higher degree no positive or 
negative sign should be attached to p, for the relationship 
might be positive over part of the range and negative over 
other parts, as in the alfalfa example given above. 

The index of correlation, p, has no significance unless the 
type of curve to which it applies be named in each case. 
The meaning of r in this respect is always clear, for it is 
understood that it relates always to a straight line, but. 
confusion would arise in the case of p unless the type of 
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curve were specificaUy mentioned. The index of correlation 
may be looked upon as a measure of the adequacy of 
curve of given type to describe the relationship between two 
variables. 

It is, of course, always possible to secure a curve which 
will pass through any number of points if the constants 
in the equation be equal to the number of points. In such 
a case p would, of necessity, be equal to 1, but this value 
would have no significance. In any employment of mathe- 
matical functions there is this limit of absurdity, when the 
number of constants is equal to the number of points and 
P would merely reflect this absurdity. The ordinary prin- 
cipks of curve fitting must be kept in mind in using such 
an index as this. It must never be taken to have an absolute 
significance, standing by itself. Its significance is always 
relative, referring^ to the particular function employed. 
This fact, which is true of every measure of correlation, 
IS frequently overlooked, and invalid and fallacious con- 
elusions reached as a result. 


A SHOET METHOD OP COMPUTING THE INDEX OP COEEELATION 

The Standard error and the index of correlation were 
computed by a rather laborious method in the above ex- 
ample, in order that there might be no misunderstanding 
of then- precise naeanmg. The burden of calculation may 
be ma,tenally reduced, however, by taking advantage of 
the relationships which were disclosed in dealing with r. 
For a curve of the potential series 

F = a4-6Z + cX2-fe^ZV. . . 

the formula for S, is derived by a simple extension of that 
employed m the case of the straight line. As a general 
formula for a series of this type, we have 

S(F^) — a S(F) — hS(XY) — eSfZ^F) — d2(XW) — 

w~ — — 


s. 


Similarly, the formula for r may be extended to give 
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general formula for p applicable to any equation of this 
general type. This formula^ is 

„ aS(F) + VSiXY) + c2(XW) + rfS(ZT) + . - Nc„^ 

p„x - S(F2) - JVc/ 

In the special case in which the origin is at the mean of 
the F’s, 2 ( 2 /) - 0 and Cy = 0, and the formula reduces to 

. , _ iSiXy) + c2(XV) + (S(XV) + . . . 

- 2(2,2) 

The characteristics of the formulas for S and p should 
be noted. The only values required in securing these 
measures are the constants in the equation which describes 
the average relationship, certain values which have been 
used in the process of fitting and, in addition, 2(F®) and 
Cy^. Thus, as direct by-products of the fitting process, we 
have the values of S and p, the two measures which are 
needed to supplement the regression equation in securing 
a complete description of the relationship between the two 
variables in question. The equation describes the average 
relationship. The standard error, 8, is a measure of the 
reliability of estimates based upon this equation, and p is 
an abstract index of the degree of relationship, in so far as 
that relationship can be described by the particular curve 
employed. 

The application of these formulas may be illustrated 
with reference to the problem of alfalfa yield. The following 
values, derived from the data of Table 102 and from the 
fitting process, are required for this purpose: 


a = 3.539 
6 = .252652 
c= -.002827 
2(F) = 329.03 
2(ZF) = 10,271.72 


:(X'F) - 407,564.64 
c,; = ,55.9197 
2(F=) = 2,688.2268 
iV = 44. 


Substituting in the formula for the standard error for a 


^ See Appendix A for discussion -of the derivation , of this foriimla. 



non-linear correlation 



we have 

2 688 2288 - (3.539 X 329.03) - (.252652 X 10,271.72) - f - .002827 X 407, .564.64) 

80.8043 
~ 44 

= 1.8365 
-Sj,= 1.36. 

The index of correlation, for a curve of this type, is 
computed from the equation 

„ _ aS(F) + 5S(ZF) + cSiXW) - Ac/ 

S(F)2 - NCy^ 

Substituting the appropriate values, we have 

146.9557 

2,688 . 2268 - (44 X 55 . 9197) 

= .6452 

Pi/x “ - 80. 

The value of the index of correlation is influenced by 
the relation between the number of observations and the 
number of constants in the equation of relationship. When 
the two are equal p will have a value of 1. In any case the 
observed index of correlation tends to exceed the true index. 
'When the number of observations is not large it is advisable 
to apply a correction for this bias. If we use p to represent 
the corrected value and m to represent the number of 
constants in the equation of relationship, we may apply 
a correction in terms of the relation ^ 



(1 - P./) 



Inserting the values given in the above example, we have 


^ From Mordecai Ezekiel, Metiwds of Correlation, Analijsis, Now York, 
WUey, 1930, 121. 
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pV - 1 - { (I - .M52)(«— J) } 

= .6279 
Piix “ . / 9. 

If, in the application of this test, the value in brackets | ! 
exceeds unity, the value of p is taken as 0.^ 

These methods of deriving S and p are applicable over 
a wide field by a simple adaptation of the formulas to the 
particular equations that may be employed in given 
instances. Further illustrations are given in Chapter XVII, 
while this general method is explained in more detail in 
Appendix A. 


The Correiation Ratio 


A third distinctive measure of correlation remains to 
be described. This is the correlation ratio, devised by 
Karl Pearson and represented by the symbol rj (eta). 
This measure may be looked upon as a special case of p, 
but somewhat different methods are employed in its com- 
putation. 

We have seen that in all cases the degree of relationship 
between two variables, as described by a curve of a given 
type, may be determined from the formula 


Measure of correlation = 



The coefficient of correlation, r, is just such a measure, 
when Sy represents the standard deviation about a straight 
line. The index of correlation, p, is a general measure of 
the same type. The correlation ratio is precisely the same 
sort of measure, Sy in this case representing the standard 
deviation about a line passing through the mean of every 


^ A corresponding correction should be made in the standard error of esti- 
mate, wlien derived from a small number of observations. In this case thes 
correction must raise the unadjusted measure. For this correction Eaekie! 
gives 



where S represents the corrected standard error of estimate. 
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column in the correlation table. We have, in effect, increased 
the number of constants in the equation of the curve to 
be fitted until the number is equal to the number of columns. 
If the means of all the colunans lie on a straight line, the 
correlation ratio and the coefficient of correlation will be 
equal. If the means of the columns do not lie on a straight 
line, the correlation ratio will be greater than the coefficient 
of correlation. 

No new principle is involved, therefore, in the concept of 
the correlation ratio. It is employed when the regression 
is non-linear. It measures the degree of relationship be- 
tween two variables, in so far as this relationship may be 
described by a curve passing through the mean of every 
column. If the relationship is perfect, if there is no scatter 
about the curve fitted in this way, ri will have a value of 1. 
If there is no relationship, if the scatter about the curve 
is as great as the dispersion about the mean of the F’s, i} 
will have a value of zero. 

The formula generally employed in the computation of 
the correlation ratio differs somewhat from that given above. 
To represent the standard deviation about the line joining 
the means of the columns, the sjonbol ffay is employed, 
instead of Sy. Its meaning is precisely the same as that 
of Sy, as employed above, except that <Tay refers always 
to a correlation table. 

The formula may be written 

When eta is written as above (riyx) it refers to the regres- 
sion of F on A (F dependent). When it is written it 
refers to the regression of Z on F {X dependent), and its 
value depends upon the scatter about a line joining the 
means of the rows. Unlike r, which has the same value 
for both regressions, riyx and r]^ will have different values 
unless the regression be linear. 
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THE COMPUTATION OP THE COREBLATION EATIO 
Table 104 shows the general relation between the amount 
of nitrogen, in pounds per acre, used as fertilizer in certain 
agricultural experiments, and the corresponding yield of 
wheat, in bushels per acred The points are plotted in 
Fig. 83. 


Table 104 

Correlation Table Showing the Relation between Wheat Yield per Acre 
and Amount of Nitrogen Used as Fertilizer 




X — Nitrogen applied in pounds ;>rr nrrv \ 



0- 

19.9 

20- 

39.9 

40- 

59.9 

60- 

79.9 

80- 

99.9 

100- 

119.9 

120- 

139.9 

140- 

159.9 

160- 

179.9 

Total 

M'ea,n 

of' 

Rows 


32- 

35.9 




5 

16 

12 

4 

5 

2 

44 

107.27 

F — Wheat uielbi in hushds per acre j 

28- 

31.9 



1 

20 

21 

. 8 

4' 

1 


55 

88.91 

1 

24- 

27.9 



16 

19 






35 

Oftsoj 

f 

20- 
23.9 . 



13 







13 

50.0 

16- 
19.9 , 


12 





- , 



12 

30.0 

■ 12- 
15. .9 ; 


S 







. ■ ■ i 

■ "S' 

I 30.0 

1 

i 8-" 

' 11.9 

3 

■5 ■' 







1 

S, 

22 . 50 

7.9 

10 








■! 

10 

iri.o 

0- 

S 

I 

■ 




i" ■ 



1 

i 

i i 

■ 8 

10.0 

Total 

21 

25 

30 

44 

37 

■. 20. 

8 

6 

'2 ' 

1 193 


■Mean 

of 

Columns 

5.05 

15.12 

24.4 

28.73 

31.73 

'3.2,4 

32.0 

33.33 

34.0 




^This tabie is based upon experiments described by E. Davenport (“fom- 
parative Agriculture’’ in Bailey .Cyclopedw of ■ American AgriciMimX The 
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For the computation of rjyx by the formula given above 
we need the values of <r„ and o' ay, the latter being the 
root-mean-square deviation about the line joining the means 
of the various columns. The former value may be obtained 
readily by methods already familiar. It is possible to 
compute the quantity (Xav by the method first employed 



Fig. 83. — Scatter Diagram Showing the Relation between Wheat Yield 
and Nitrogen applied as Fertilizer, with Straight Line of Regression and 
line joining the Means of the Columns 


in calculating Sy, that is, by measuring and squaring the 
deviations of the individual points from the line of regres- 
sion. In the present case, however, the line describing 
the relationship passes through the mean of each column, 
hence these means may be used in place of the “normal” 
values as computed from an equation of regression. In 
computing day, therefore, the deviations of the individual 

actual figures used have been arbitrarily chosen for the purpose of the prt^sent 
illustration, but Davenport's experiments have demonstrated the existence 
of a law similar to the one here assumed. 
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items from the means of the various columns are squared, 
added, the mean determined and the square root extracted, 
just as in the computation of the standard deviation. 
Part of the procedure is illustrated in Table 105, using the 
data in the first column of Table 104. This colunm contains 
all items having X-values between 0 and 20. The mean 
F-value of the 21 items falling in this column is 5.05; 
deviations are measured from this value. 


Table 105 

Co?npviaiion of the Squares of the Deviations about the Mean of ari 


Class-interval 
wheat yield in 
bn, per acre) 

m f 

Array 

Deviation from 
mean of cohmm 
(5.05) 
d 


fd^^ 

8-11.9 

10 3 

4.95 

24.5025 

73.5075 

4- 7.9 

6 10 

.95 

.9025 

9.0250 

0— ,3 . 9 

Total 

2 8 

- 3.05 

9.3025 

74.4200 

156.9525 


The sum of the squared deviations is obtained for each 
of the other columns in a similar fashion. The standard 
deviation about the means of all the columns, cr„„, is found 
to have a value of 2 . 420. The value of is 9 . 188. 

Substituting the given values in the formula 



(2.42) 

(9.188) 


= 1 - .0694 
= .9306 


= .965. 


This is the value of the correlation ratio, measuring the 
degree of scatter about a line running through the means 
of the columns. Its significance is discussed below. 

The method of calculation employed in the preceding 
example may be materially shortened. Let cr„,j, represent 
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the standard deviation of the means of the various columns 
about the arithmetic mean of all the F's. In computing 
this value the mean of each column is weighted by the 
number of items in that column. It may be shown^ that 


1 The following proof is adapted from Yule. 

Given a series with mean M made up of two component series with means Mi 
and If 2 . the total number of observations, is equal to Ni + iV 2 , the sum of 
the observations in the two component series. What is the relation between tr, 
<ri and 0 - 2 ? If we let Mi — M — ci 

then for the mean-square deviation of the observations in the first of the 
two component series, measured from M as origin, we have 

Similarly >§ 2 ^ = 0 - 2 ^ -j- 02 ^. 

But NiBi^ is equal to the sum of the squares of the deviations, about M, of 
the items in the first of the component series, and N^B^^ is equal to the sum 
of the squares of the deviations, about M, of the items in the second of tlie 
two component series. Therefore 


and 

But 

therefore 


" ^ , N 

N<r^ = NiBi^ + ^ 2 ^ 2 ^. 

Si^ = Ci^ and B^^ = cr2^ 4 " C2“ 

iVcr2 


( 1 ) 


Ni{crr 4” 4" 4* C2^)- (2) 

In the present case we have the major series vrith mean represented by My, 
and a number of component series (the items arranged by columns) with means 
represented by myi, etc. Let Sa// represent the standard deviation of any column 
of F^s about the mean of that column. Then we have a number of component 
series, with standard deviations Sayi, et«., and with means differing from the 
mean of all the by My — ruyi, etc. Substituting in equation (2), we have 


N(Xy^ ~ 

N<Ty‘^ = MBay^ 4- {My 

But 

for, in each column, 


wzyjp] 4“ n^lBay^^ -h {My 

Ntray^ - S(n • Say^) 

Oaii — . 


my,r] 4- . 


(3) 

(4) 


since d repi-esents a deviation from the mean of that column. For all columns, 
^ l.{nYBay^) 

~ N ~ N ' 

Substituting in equation (4) 

^ -- niyY^ (J>) 

By definition of the standard deviation of the means of the columns 

^ 'Si7l{My TIlyY 

= f^av^ 4” 


Therefore, from (5), 




THE CORRELATION RATIO 


419 


Cfy'' Cfyny 


Substituting for Say'^ in the equation 


we secure 


= 1 




Vyx 


^ my 

O'/ 
O' my 


Since (T„,y may be much more easily determined than Oay 
the value of ?? is generally computed from this formula. 
The data of Table 104 may be used to exemplify the process. 
Calculations appear in Table 106. 


Table 106 

Illustrating the Computation of the Correlation Ratio 


Type of array 
(X-value of 
items in 
column) 
(pounds) 

Mean value 
of Y-items 

Deviation 
from mean 

Square of 

Fre- 


in column 
(bushels) 

of all Ts 
( 25 . 005 ) 

deviaimi 

qiiefioy 



Uy ... 

d 


f 

fd^ 

. 10 

5 . 05 

- 19.955 

398.202 

21 

8,362.242 

30, 

15.12 

- 9.885 

97.713 

•25 

2,442.825 

' .. 50 

24.40 

~ .605 

.366 

30 

10.980 

, 70 ' 

28.73 

+ 3.725 

13.876 

44 

610.544 

: : 00 ■ 

31.73 

+ 6.725 

. 45.226 

37. 

■ 1,673.362 

. 110. , 

32.40 

+ 7.395 

54.686 

20 

,1,093.720 

130 

32.00 

+ 6.995 

48.930 

8 

391.440 

150 

33.33 

+ 8.325 

69.306 

,6 , 

415.836 

170 

, , 34.00 

+ 8,995 

■ 80.910 

2 

161.820 

Total 




.. 193 ,^ ; 

15462.769 


15,162.769 

193 



= 8.864. 
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Substituting the given values in the formula 

we have 

_ 8.864 
= .965. 

The process of computing the correlation ratio mav be 
briefly summarized: 

1. Arrange the items in the form of a correlation table 

2. Find the arithmetic mean of all the F-items in each column 

(i.e. , find the anthmetic mean of each F-array of tvne X) 

3. Compute the arithmetic mean of all the F^s - 

Of the mean of each column from the 
mean of all the } s. Square each of these deviations and 

given column. Get the 

feum 01 the squared de\nations. 

5. Divide this sum by the total number of items and extract the 

square root of the result. This gives the value o?lr 

6. Compute 

7. Divide cr„„ by <Ty. The quotient is 

The value of the correlation ratio of X on F may be 
fomuli^ substitutiPg the proper values in the 


The syrnbd (r„^ represents the standard deviation of the 
means of the various rtms about the mean of all the X’s. 
i he value of the correlation ratio of X on F depends upon 
he amount of scatter (horizontaUy) about the line joining 

frnnT+hT f ^ generally be different 

correlation ratio of F on X. In the present 
case the value of is found to be .824. As the line of 

relationship approaches the linear form the two correlation 

ratios approach identity. 



mi 


THE CORRELATION RATIO 

Like r, rj can never exceed 1 , this value being secured 
when there is no dispersion about the line joining the 
means of the columns (or rows). From the formula 



it is evident that the value of the correlation ratio is zero 
when iJmv is zero. This is the case when the mean of each 
colunxn has the same value as the mean of all the F’s. 
Such a condition is found when an increase or decrease in 
the value of the X- variable brings no corresponding change 
in the value of the F- variable. This means that in each 
column of the correlation table there is a distribution of 
cases similar to the general distribution of F’s. When 
this is true there is clearly no relation between the two 
variables. 

The correlation ratio, it should be noted, never has a 
negative value. It is possible to determine by inspection 
of the correlation table, however, whether the relation 
between two variables is direct, or inverse, or a varying one. 

The coefficient of correlation has one distinct advantage, 
as compared with the correlation ratio, in that when its 
value and the values of the two standard deviations are 
known the equations to the lines of regression may be 
readily determined. This is not true of t?. To get a quantita- 
tive expression for the “law” of relationship between two 
variables, when 57 has been computed, an additional calcula- 
tion for the purpose of fitting a curve to the means of the 
arrays would be necessary. 

COREECTION OF THE COBEELATION RATIO 

The use of n is only possible when the data are numerous, 
and can be arranged in the form of a correlation table. If 
a limited number of items should be so arranged, and it 
chanced that there was but one item in each column, the 
two measures cr,„„ and cr„ would be identical and 7} would 



422 NON-LINEAR CORRELATION 

necessarily have a value of 1. Computed from a very small 
number of cases and employing a large number of classes 
the correlation ratio would be meaningless. ’ 

The raw correlation ratio may be corrected by the method 
employed on a preceding page for the index of correlation 
with m set equal to the number of groups (i.e., to the num- 
ber of columns, for Vyx', to the number of rows for 
Thus, if ^ be the corrected value, we have ' 

? - 1 - {(1 - '>=)(^)} ■ 

In the present instance 

. 1 - {(, _ . 8306 )(§^ 1 )} 

= .9276 
V = .963. 

Ihe correction is very slight in the present case, but if 
A were small or m very large it would reduce the given 
value materially. 

BELATION BETWEEN THE CORRELATION RATIO AND THE 
COEPPICIENT OP CORRELATION 

When the relation between two variables is absolutely 
inear the line running through the means of the columns 
corresponds, of course, to the line upon which the coefficient 
of correlation is based. When this is the case n and r have 
the sa,me value. As the relationship between the two 
variables departs from the linear form the values secured 
tor ^ and r differ, r, being always greater than r. This 
results from the fact that the scatter about a line joining 
the mea^ of the columns will always be less than the 
sea . er a out a straight line fitted to these points, except 
r + 1 . ^ straight line passes through every mean point. 

R e ess the^ scatter about the line expressing the 
average le ationship the greater the value of the measure 
o corre ation. Thus for the alfalfa problem it was found 
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that r has a value of + . 69, and that an index of correlation 
based upon a second degree parabola has a value of .80. 
The correlation ratio for the same material is .82. For 
the data of Table 104 the value of (uneorrected) was 
found to be .965; the value of r is + .793, the difference 
between the two being marked. The reason for the difference 
is found in Fig. 83, in which the straight line of regression 
of F on X and the line joining the means of the columns 
are shown. The regression departs materially from linearity, 
and the scatter about the straight line of regression is much 
greater than the scatter about the line joining the means. 

The relation between r and ?? affords a convenient test 
of linearity in a given instance, since the two values will 
be identical when the regression is strictly linear, and will 
differ the more widely the greater the departure from the 
linear form. The general test for linearity is 

Even in a case of linear regression it is probable that y 
and r will differ somewhat because of fluctuations due to 
chance alone. A material difference, as reflected in the 
magnitude f (zeta), indicates that a straight line does not 
describe the relationship in question and that r is not a 
suitable measure of correlation. In the example given 
above, in which 5? equals .965 and r equals .793, the 
measure f has a value of .302. (The uncorrected is used 
in this test.) This is large enough to indicate that the regres- 
sion is non-linear. 

In later sections methods of testing for linearity are 
more fully discussed. 
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CHAPTER XIII 


ELEMENTARY PROBABILITIES AND THE 
NORMAL CURVE OF ERROR 

Reference has been made in an earlier section to the 
family resemblance which is found among frequency distri- 
butions drawn from widely different fields. Attention was 
also ch-awn to a certain basic type, represented graphi- 
cally by the symmetrical bell-shaped curve, which is called 
the “normal curve,” or the “normal curve of error.” In 
an earlier day this curve was looked upon as representing 
a fundamental law which described all distributions of 
quantitative data. From the modern standpoint this was 
quite an erroneous conception. The normal curve is viewed 
today as but one of a number of types of curves which 
may be used to describe frequency distributions. It is, 
however, by far the most important type. For many of 
the measurements used to describe distributions of observa- 
tions (measurements such as the mean, the standard devia- 
tion, the coefficient of variation) are distributed in accord- 
ance with this normal law of error. The procedures employed 
in generalizing results obtained from the study of samples 
and, in particular, in determining the reliability of such 
generalizations, lean heavily upon this law. An under- 
standing of the characteristics of the normal curve is 
(‘.s.sential to the statistician. 

ELBMPJNT.iRY THEOREMS IN PirOBABILlTY 

We may approach this subject by a brief consideration 
of certain elementary principles of probability that enter 
into many forms of statistical work. A detailed exjilanation 
of the theory of probability would carry us beyond the 

; 
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limits of the present volume. The treatment which follows 
IS presented only as an introduction to the subject, designed 
to illustrate, by simple numerical examples, the relation 
between the principles of probability and the normal law 
of error. 

In this argument we may use the following standard 
notation. If an event can occur in n ways, a of which are 
to be^considered as successful and h as unsuccessful, the 
probability p of a successful outcome may be written 


and the probability q of an unsuccessful outcome may be 

WTlf.fars 


Since the sum of the favorable and unfavorable outcomes 
IS equal to the total number of events, we have 


Dividing by n, 

SO that , 


d -j- h 


2 + 

n n 


jt' I ^ , , , 

A probability, therefore, may be written as a ratio. The 
numerator of the fraction corresponding to this ratio repre- 
sents the number of favorable (or unfavorable) outcomes, 
While the denominator represents the total number of 

(possible onteomes. 


, .EXAMPLES, OP SIMPLE PEOBABILITIES ( . 

,1 turning up of a head being looked 

upon as a favorable outcome, we have, as the probability 
ot a success, 


1 



THEOREMS IN PROBABILITY m 


and of a failure, 


If we roll a die, regarding a six spot as a favorable outcome, 


P = 6 


If a card be drawn from a pack of 52 the chance of drawing 
the ace of spades is sV, of failing in that endeavor, fi. 


THE ADDITION OP PROBABILITIES 

What is the chance of securing either an ace of spades 
or a two of spades in a single draw from a pack of 52 cards? 
In such a case, where any one of several outcomes will be 
considered as favorable, the probability of a success is 
the sum of the separate probabilities. In this example 

= JL4-1. = _L 

^ 52 52 26’ 


The chance of di’awing either a heart or a spade from a 
pack of playing cards is given by 

13 , 13 1 

^ 52^^52 2 


THE MULTIPLICATION OP PROBABILITIES 

Two events are said to be independent when the outcome 
of one does not affect the outcome of the other. Thus the 
result, of one throw of a die does not, presumably, affect 
the result of the next toss. The probability of a compound 
event (i.e., that two events, independent of one another, 
vill both occur) is the product of the probabilities of the 
I separate events. Thus the chance of securing an ace, 
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followed Ly a two spot, in two successive throws of a Hip 

= 1 X ~ 

6^6 36 

In computing the probability of a given outcome it i, 
frequently neceeaary both to multiply and to add prTbabiB 
ties. For example, we wish to determine the chance of 
«curmg he toW 5 from two dice thrown simultan“* 
We may label the dice a and 6 to distinguish them. Tte 

combWons': “ f"” '“"“""e 


Die I 

1 

2 

3 

4 


Die b 


O 

2 


The chance of securing an ace with die a is * of senir 
mg a 4 with die 6 is The chance of the Tw^’in cS' 
nation is Snnilarly, the probability of each of the other 
hree combinations is But any Le of thSe fom'e- 

ful. hLc?^^ ^ ^ considered success- 

^ ~ 36 36 36 

is the probabmtv answered the question; What 

dice^ We St 5 in the toss of two 

chance of 

Just as in thp ^ ®®°sidered a favorable outcome. 

proLbilitv of the 

sucrz t£ 

the probability of each offtLetr" 
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m 


Probability of throwing 12 with 

two dice = — 
36 

a , a it 

11 

a 

« « 2 
36 

a it ii 

10 

it 

II 

a it it 

9 

a 

II 

a a it 

8 

a . 

II 

it it ti 

7 

it 

a a 0 

it a li 

6 

“ 

it it 5 

m 

it a it 

5 

it 

36 

OA 


Sum of above probabilities = ^ 

36 


The chance of throwing at least 5 in the toss of two dice is, 
therefore, M or f. 

The Binomial Expansion and the Measurement op 
Probabilities 

It is possible to express these facts in a generalized form. 
A simple illustration may be employed to exemplify the 
derivation of the general expression. 

If two coins are tossed simultaneously there are four 
possible outcomes 

a b a b a b a b 
TT TH HT HH. 

(The two coins are represented, respectively, by the letters 
a and h.) The chances of securing no heads, one head, and 
two heads are, respectively, i, I, and i. If three coins 
(represented by the letters, a, 6, and c) are tossed simul- 
taneously, we have eight possible outcomes 

ah c a h c a b c a b c a b e a b c a b c a h c 

TTT TTH THH THT HTT HTH HHT HHH. 
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The chances of securing no heads, 1 head, 2 heads, and 
3 heads are, respectively, i, f, f, i- 
But these results may be derived without working out 
the separate probabilities in detail. We have employed 
p and q to represent, respectively, the probability of success 
and failure of a given event. If there are two independent 
events the compound probabilities are given by the expansion 
of the expression 

(P + qY- 

For the case in which p (e.g., the probability of throwing 
a head) = g = the probabilities of the various results 
are given by 


<2^2 


i + i + 1. 

4 ^ 2 ^ 4 


These are the results secured in the first example cited 


in this section, 
with p = g = A 


If there are three independent events, 
we have 


1 + 1 

2^2 


iV = 1 

2 / 8 


+ 8 + 8 + 8 ’ 


the probabilities secured in the second example. 

If we wish to know not the separate probabilities but the 
probable frequencies of the various outcomes in a given 
number of trials, these may be computed from the expression 

N{p + qY 

where N represents the number of trials and n the number 
of independent events. Thus if there are 200 trials and 
there are two independent events, the probable frequencies 
are given by 

200 (p + g)* = 200(p® + 2pq + q"^). 

With p = q = I this gives us 


i) + 20# + 20o(i 


= 50 + 100 + 50 
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which indicates the probable frequencies of 2 successes, 
1 success, and no successes. 

If there are three independent events, the probable 
frequencies in N trials are determined from the binomial 
expansion of 

N(p -1- qy. 

If N equals 200, we have 

200 (p 3 + 3p^q + 3pq2 4 . gS). 

If p equals i, we have 

20o(i) + 200(1) + 200(1) + 200(1). 

These terms indicate, in order, the probable frequencies 
of 3 successes, 2 successes, 1 success, and no successes. 
The total frequencies secured by carrying through the 
process of multiplication wall be equal to the number of 
trials, for all possible outcomes are covered by the expansion. 

Thus, when we know in advance^ the probabilities attach- 
ing to similar but independent events, we may determine the 
probable frequencies of any given number of successes or 
failures. This is true whether p and q be equal or unequal. 
It is necessary only that p and q remain constant. There 
is here a fact of great significance in the development of 
statistical theory. 

A COMPARISON OP ACTUAL AND THEORETICAL FREQUENCIES 
IN THE REALM OF PURE CHANCE 

Certain points of importance may be made clear by 
comparing some experimental results with the theoretical 
frequencies given by the binomial expansion. Twelve dice 

^ A distinction is generally drawn between a priori probalnlities of tlie type 
described above, and empwcaZ probabilities, knowledge of which is cierixTd 
from observation or,.experience.' ’.'As. an example' of the latter type wc have, 

71 173 

m the probability that a man aged. 35 will 'live 10 years, the ratio This 

51, w*. '. 

is based upon the American Experience Table of Mortality which shows that 
of 81,822 men living at the. age 'of 35,,.' there are 74,173 living ten years later. 
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were thrown a number of times. Each 4, 5, or 6 spot 
appearing was considered to be a success, while a 1, 2, 
or 3 spot was a failure. (In a typical throw we might have 
the following spots up: 3, 1, 5, 1, 2, 4, 4, 6, 3, 2, 3, 5. In 
this lot there are five successes, and the result is so tallied.) 
In a classical example recorded by W. F. R. Weldotf 
twelve dice were thrown in this way 4,096 times, a success 
being defined as above. The results are recorded in column 
(2) of Table 107, and the distribution is shown in Fig. 84. 
By computation we find the arithmetic mean and the 
standard deviation of this distribution to be, respectively, 
6.139 and 1.712. 

Let us compare with these results those which we might 
expect from the given conditions. Twelve dice were thrown 
each time, hence we are dealing with 12 independent 
events. There were 4,096 trials. Since either a 4, 5, or 6 is 
considered a success, p = g = i. 

For the terms in the binomial expansion we have 


(p + I'i)" = p'‘ + np’'~'q + ■ 


o-v + 


In the present case we have 


4,096( J + J)‘'. 


Expanding 


anoR/^_L 12 , 66 , 220 , 495 , 792 924 

’ V4,096 4,096 ' 4,096 4,096 4,096 4,096 4,096 

, 792 495 220 66 12 1 Y 

4,096 4,096 4,096 4,096 4,096 4,096/ 

Completing the indicated multiplication we have the theo- 
retical frequencies of the various possible successes in 
4,096 throws of twelve dice. These are shown in column 
(3) of Table 107. 

> Cited by F. Y. Edgeworth, HrU., lltli ed., Vol. XXII. .‘594. 
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Table 107 

Comparison of Actual and Theoretical Frequemies in Dice-Rolling 

Experiment 


(1) 

(2) 

(3) .. 

Ntmher of 

Observed 

Theoretical 

successes 

frequencies 

frequencies 

0 

0 

1 

1 

7 

12 

2 

60 

66 

3 

198 

220 

4 

430 

495 

5 

731 

792 

6 

948 

924 

7 

847 

792 

8 

536 

495 

9 

257 

220 

10 

71 

66 

11 

11 

12 

12 

0 

1 


4,096 

4,096 


The distribution of the theoretical frequencies is shown 
in Fig. 84, with that of the observed frequencies. The 
relationship of the two distributions is close. 

When we have, as in this case, a knowledge of the 
probabilities involved, it is possible to determine the arith- 
metic mean and the standard deviation of the distribution 
of the theoretical frequencies. As a general expression for 
the mean number of successes, where the number of inde- 
pendent events and the probability of success are known, 
we have 

M — np. 

Applying the present values, 

M = 12 x | = 6. 

The mean, as computed from the observed frequencies, is 
0,139. 
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As a general expression for the standard deviation/ under 
the same conditions, we have 

tr = Vnpq. 

In the present case 

a = /|/l2 X I X 2 = V3 
= 1.732. 



Dice-Rolling Experiment 

The standard deviation, as computed from the actual 
frequencies, is 1 . 712. 

When proportions, or relative frequencies, are dealt with, 
the standard deviation (a-') may be derived from the relation 



' This formula for the standard deviation of a binomial distribution is of 
central importance. The derivation of this formula, and that for the mean oi 
a binomial distribution, are given in Appendix B. 
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The Normal Curve op Error 

We may return to a consideration of the curve in Fig. 84 
which represents the theoretical frequencies in the dice- 
throwing experiments. It is a perfectly synametrieal 12-sided 
polygon, the number of sides (excluding the base) corre- 
sponding to the number of independent events in the particu- 
lar problem considered. With six events we should have a 
six-sided figure, with twenty events a twenty-sided figure, 
and so on. It is obvious that, as rj increases, the number 
of sides to the polygon increasing correspondingly in num- 
ber, the graph representing the expansion of the binomial 
(p + ?) ’* approaches more and more closely a smooth curve. 
With 11 infinitely large a perfectly smooth curve would be 
secured. This is the normal curve of error which has been 
plotted in Fig. 85. 

The equation to this cmve is written in several forms, of 
which 

2o-2 

y = Poe 


is one. In this equation ijo, the maximum ordinate, is a 
constant; e is a constant (the base of the Napierian loga- 
rithms) having a value of 2.71828; a represents the stand- 
ard deviation; and x is a given value of the dependent 
variable expressed as a deviation from the me.an. The maxi- 
mum ordinate may be derived from the relation 

_ N 

hence the equation to the normal curve may be written 



N 

cr\/2ir^ 




where TT is the constant 3. 14159. 

This equation may be derived in several ways.^ One 


^ Gauss’, deduction of the error, equation’ 'may 'be fo'und in all standard works 
on the theory of least' squares. ,Cf. references at,'end of Appendix A. , 
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procedure which throws light on the physical conditions 
giving rise to the emergence of a normal distribution, starts 
from three basic assumptions. 

1. The causal forces affecting individual events are 
numerous, and of approximately equal weight. 

2. The causal forces affecting individual events are 
independent of one another. 

3. The operation of the causal forces is such that devia- 
tions above the mean of the combined results are balanced 
as to magnitude and number by deviations below the mean. 

A peat part of ^ the power which modern statistical 
technique possesses is derived from the detailed knowledge 
of the characteristics of the normal or Gaussian curve. 
From prepared tables showing the fractional parts of the 
total area under the curve lying between ordinates erected 
at stated distances from the maximum ordinate, theoret- 
ical frequencies may be determined much more readily 
than by the laborious method based upon the binomial 
expansion. 


USE OP A TABLE OP AREAS UNDER THE NORMAL CURVE 

The entire area under a frequency cmve is taken to 
represent the totd number of frequencies. Given information 
p to the proportion of the total area within a given segment, 
rt would be easy to compute the frequencies represented 
by this segmep, or to determine the probability that a 
given observation from the population represented by the 
curve would fall within the limits of this segment. Prepared 
tables of the probability integral, of which Table 108 is an 
example, serve just this purpose, with respect to the normal 
curve. (A more detailed table than that here given is 
needed, for accurate computation. Appendix Table I will 
serve most purposes.*) 


noi^l curve, a,s calculated by Dr. W. P. Shep- 

m^ncinn't ‘‘^talisticians and Biu- 

melnciaM , « dited by Karl Pearson, Biometric Laboratory, University College, 
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Table 108 

Area of the Normal Curve in Terms of Abscissa 

(Giving fractional parts of the total area between and ordinates erected 
at varying distances from y,,) 


xlcr 

a 

xla 

' a 

0.0 

.00000 

2.0 

.47725 

0.1 

.03983 

2.1 

. 48214 

0.2 

.07926 

2.2 

.48610 

0.3 

.11791 

2.3 

. 48928 

0.4 

. 15542 

2.4 

.49180 

0.5 

.19146 

2.5 

.49379 



2.5758 

. 49500 

0.0 

.22575 

2.6 

■ .49534 

0.7 

.25804 

2.7 

.49653 

o.s 

.28814 

2.8 

.49744 

0.9 

.31594 

2.9 

.49813 

1.0 

.34134 

3.0 

.49865 

1.1 

.36433 

3.1 

.49903 

1.2 

.38493 

3.2 

.49931 

1.3 

.40320 

3.3 

.49952 

1 . 4 

.41924 

3.4 

.49966 

1.5 

.43319 

3.5 

.49977 

1.6 

.44520 

3.6 

.49984 

1.7 

.45543 

3.7 

.49989 

1.8 

.46407 

3.8 

.49993 

1.9 

.47128 

3.9 

.49995 

1.96 

.47500 

4.0 

.49997 


Since the normal curve is symmetrical about the maxi- 
mum ordinate, the values given in Table 108 apply to 
observations on both sides of the mean. In using such a 
table, deviations from the mean are first expressed in units 
of the standard deviation. (The term normal deviate is 
applied to such a quantity, that is, to a deviation from the 
mean of a normal distribution expressed in units of the 
standard deviation of that distribution.) The proportion 

Ijondon; Tables of Applied Mathematics, J. W. Glover, Ann Arbor, Michigan, 
George Wahr; Manual of Problems and. Tables in Statistics, F. C. Mills and 
D. H. Davenport, New York, Henry Holt and Co. 
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of the total area lying between any two ordinates may then 
be readily determined. For example: What proportion of 
the cases in a normal distribution lies between the maximum 
ordinate and an ordinate erected at a distance from the 
mean equal to + o-? Reading down the x/cr column to 1.0, 
we find the value .34134 opposite it. This, in ratio form, 
is the proportion of cases falling within the limits indicated. 



Fig. 85. — An Illustration of the Measurement of Areas under the Normal 

Curve 

Expressing this ratio as a percentage, we have 34 . 134 per 
cent as the answer to our question. 

Fig. 85 shows the relation of this area (the shaded area A) 
to the total area under the curve. 

What proportion of the total number of cases in a normal 
frequency distribution will fall between an ordinate erected 
at a distance from the mean equal to — 1.40- and one 
erected at — 2cr? From the table we find that 41.924 per 
cent of the total area will he between and the ordinate 
at — 1.4<r; 47.725 per cent will he between and the 
ordinate at — 2cr. The difference, 5 . 801 per cent, wih fall 
between the ordinates at — 1 . 4cr and at — 2o-. This may 
be converted into actual frequencies by taking this proper- 
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tion of the total number of cases in the given distribution. 
The shaded segment B in Fig. 85 represents the area thus 
marked off. 

For certain purposes we wish to know the proportion 
of the total number of cases deviating by a stated amount 
or more in either direction from the mean of a normal 
distribution. If we wish to know the proportion of all 
cases deviating from the mean by 1 . 96cr or more, we must 
add to the area between + 1 . 96<r and the upper limit of 
the curve the area between — 1 .96cr and the lower limit of 
the curve. Each of these areas equals .50000 — .47500, 
or .025. The percentage of cases deviating from the mean 
by + 1.96o- or more is 2.5; the percentage deviating by 
- 1 . 96cr or more is 2.5. The percentage deviating above 
or below the mean by 1.96cr or more is 5.0. Similarly, 
it may be determined from the entries in Table 108 that 
Just one per cent of all the cases in a normal distribution 
will deviate from the mean, positively or negatively, bj' 
2.5758cr, or more. This “one per cent” area is represented 
by the sum of the shaded portions at the two tails of Fig. 85. 
The ordinates defining the inside limits of these segments are 
erected at + 2.5758(r and at — 2.5758<r, while the outer 
limits are at infinity. 

Special significance attaches to the two limits last men- 
tioned, because of the uses made of them in interpreting 
errors of sampling. This topic is developed at a later point. 
Here we may note that the figures defining proportions of 
the total area under the normal curve falling in given 
areas may also be interpreted as probabilities. The proba- 
bility that a given observation, made at random m a popu- 
lation distributed according to the normal law of error, 
will fall between the mean and a value one standard devia- 
tion above the mean is .34134; the probability that a given 
observation will deviate from the mean by 1 . 96o- or more 
is .05; the probability that a given observation will deviate 
from the mean by 2 . 5758or or more is .01. 
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The method by which probabilities of occurrence may 
be determined from a table of areas under the normal 
curve, and by which the significance of a given normal 
deviate may be established, should be clearly understood. 
These methods enter in many ways into the work of a 
statistician. 

The uses of the normal curve of error, and of the table 
of areas based upon the integration of this curve, are too 
varied to be enumerated at length here. A simple example 
may serve to introduce the subject. 

An Economic Application 

The statistical division of the American Telephone and 
Telegraph Company has made a study of the annual 
message use of four-party line residence message rate sub- 
scribers in Buffalo. The annual messages for each of 995 
subscribers were tabulated and classified.^ The results, 
together with certain computations, appear in Table 109. 


THE moments of A FREQUENCY DISTRIBUTION 

Some terms and symbols that have not been employed 
heretofore may be introduced at this point. We may 
write, using v (nu) to define certain quantities of interest 
to us. 


Vl 


, W) 

N 


J'2 

J'3 

Vi 


N 

__ mx'y 

N 

N 


= first moment of the distribution about the arbitrary' 
origin. 

= second moment of the distribution about the arbi- 
trary origin. 

= third moment of the distribution about the arbi- 
trary origin. 

= fourth moment of the distribution about the arbi- 
trary ori^n. 


* “Introduction to Frequency Curves and Averages.” Statistical Bulletin, 
Statistical Methods Series, No. 1. Issued by Chief Statistician, American Tele- 
phone and Telegraph Co. 
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Table 109 

Annual Message Use of 995 Telephone Subscribers 

(Illustrating the computation of the moments of a frequency distribution) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 


Interval 
of m essage 
■use*. 

Mid- 

point 

ni 

Fre- 

quency 

J 

Deviation 
from arbi- 
trary origin 
in dass-in- 
terval units 

U' 


f(x'r 

fix’)* 

0- .50 

25 

0 

- 10 

0 

0 

0 

. 0 

50- 100 

75 

1 

- 9 

“ 9 

81 

- 729 

6,561 

100- 150 

125 

9 

- 8 

- 72 

. 576 

- 4,608 

.36,864 

150- 200 

175 

19 

— 7 

- 133 

931 

- 6,517 

45,619 

200- 250 

225 

38 

- 6 

~ 228 

1,368 

~ 8,208 

49,248 

2,50- 300 

275 

50 

- 5 

- 250 

1,250 

— €1,250 

31,250 

300- 350 

325 

95 

- 4 

- 380 

1,520 

- 6,080 

24,320 

350- 400 

375 

85 

- 3 

- 255 

765 

- 2,295 

6,885 

400- 450 

425 

115 

- 2 

- 230 

460 

- 920 

1,840. 

450- 500 

475 

132 

- 1 

- 132 

132 

- 132 

132 

500- 550 

525 

144 

0 

0 

0 

0 

0 

550- 600 

575 

116 

1 

116 

116 

116 

116 

600- 650 

625 

79 

2 

158 

316 

632 

1,264 

650- 700 

675 

54 

3 

162 

486 

1,458 

4,374 

700- 750 

725 

31 

4 

124 

496 

1,984 

7,936 

750- 800 

775 

11 

5 

55 

275 

1,375 

6,875 

800- 850 

^ 825 

5 

6 

30 

180 

1,080 

6,480 

850- 900 

875 

6 

7 

42 

294 

2,058 

14,406 

900- 950 

^ 925 . 

2 

8 

16 

128 

1,024 

8,192 

950-1,000 

975 

1 

9 

9 

81 

729 

6,561 

1,000-1,050 

1,025 

. 1 

10 

10 

lOO 

1,000 

10,000 

1,050-1,100 

1,075 

1 

11 

n 

121 

1,331 

14,641 



995 


- 956 

9., 676 

- 22,952. 

28i;564 


“Moment” is a familiar mechanical term for the mejisure 
of a force with respect to its tendency to produce rotation. 
The strength of this tendency depends, obviously, upon the 
amount of the force and the distance of the point at which 
the force is exerted from the origin. The term is used in sta- 

* As here classified an item having a. value. of 50 was put in the class having 
50 as an upper limit. Items falling on other class .limits were similarly disposed 
oL 
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tistics in a quite analogous sense, the class-frequencies being 
looked upon as the forces in question. The size of each class- 
frequency and the distance of each class midpoint from the 
origin are the factors of prime importance in this respect. 
The moments of a distribution about any origin may be 
computed by multiplying the frequency of each class by 
a given power of its distance, along the a:-axis, from the 
origin, summing the resulting products and dividing by 
the number of cases. If the first moment is desired, the 
first power of the x-distance is employed; if the fourth 
moment, the fourth power of the x-distanee, etc. The 
subscripts indicate the moments represented by the various 
symbols. 

The most significant moments, for statistical purposes, 
are those which relate to the arithmetic mean as origin. 
Representing these moments by t (pi)^ we have the 
following relationships: 

First moment about the mean = tti = 0. 

Second “ “ “ “ = xs = *-2 - 

Third “ “ “ “ — vz = vz — Zviv^ -{■ 2 vi^. 

Fourth “ “ “ “ = X4 = J'4 - -b 6f'iV2 - Si/i*. 

The computation of these moments from the data, as 
classified, involves the assumption that the items in each 
class can be treated as though they were concentrated at 
the midpoint of that class. It has been established that, 
under certain conditions, calculations made on this assump- 
tion are subject to a constant error. In particular, it has 
been shown that the values of the second and fourth 
moments are not the same, when computed from grouped 
data, as when computed from ungrouped data. 

W. F. Sheppard* has worked out certain corrections for 
this bias. His corrections may be applied when two 
conditions prevail: 

^ In the equation to the normal curve t represents the familiar constant, 
3.14159. As a symbol for a moment about the mean it relates, of course, to 
no such constant value. 

* Cf. Proceedings of the London Mathematical Society ^ Voi. XXIX, 353 “380. 
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(1) When the distribution relates to a continuous variable. 

(2) When the frequency curve is characterized by “high 
contact,” i.e., when the frequency curve tapers off gradually 
in both directions. 

The symbol m (mu) is employed to represent a corrected 
moment about the mean. The application of Sheppard’s 
corrections gives us the following final formulation : 

/lii = 0 

1 

/i2 

jUs = TTz 

M. = 7r4-|7r,+^. 

(In applying the corrections A and ^I-o, the correspond- 
ing decimal values, .083333 and .029167, will generally 
be employed.) It is assumed in making these corrections 
that a class-interval unit has been employed in measuring 
deviations from the mean. 

It may be noted in passing that the standard deviation is 
the square root of the second moment about the mean. For 
the uncorrected value, 

ff = \/v2. 

If Sheppard’s corrections^ are to be applied 

cr = \/jU2. 

The calculation of the moments of the frequency distribu- 
tion of telephone subscribers is shown on page 444. 8h(.>p- 
pard’s corrections are applied, since the curve is marked by 
reasonably high contact. It is a discontinuous distribution, 
but the unit (1) is so small in comparison with the range 
that it may be treated as continuous. 

^ It should be noted that these, corrections, when' appropriate, are applicable 
to the standard deviations entering into; the calculation of the coefficient cd 
correlation. 
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Vl = 




956 


Pa 


Vi 


995 
9,676 _ 
995 
- 22,952 


= - .960804 
9.724623 
= - 23.067337 


995 

= 284.988945 

996 


TTi = 0 

rs = ^2 - = 9.724623 - .923144 = 8.801479 

TTs = Vs - ZviVi + 2vi^ = - 23 . 067337 + 28 . 030370 

- 1.773922 = 3.189111 

X4 = J'4 — ^ViVs + 6j'i%' 2 — 3z'i^ 

= 284 . 988945 - 88 . 652760 + 53 . 863384 - 2 . 556586 
= 247.642983 

jui = 0 

Ih = TTi - ~ = 8.801479 - .083333 = 8.718146 
= ^3 = 3.189111 

M4 = 7r4 - + ^ = 247.642983 - 4.400739 + .029167 

= 243.271411 


CHITERIA OF CURVE TYPE 

Having these values, we may return to a consideration 
of the main problem, the utilization of our knowledge of 
the normal curve. There are certain criteria, represented 
by the letters d (beta) and k (kappa), which enable us 
to determine readily whether a given distribution may be 
described by a curve of the normal type. These may be 
derived from the corrected moments of the given distribution. 



10.170429 
662 632015 


= .01534853 


„ _ M4 _ 243.271411 
76.006070 


3.200683 


4(4i82 - 3(5 i)( 2/32 - 3(3i - 6) 
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/ -01534853 X 38.44 8470 _ . 5901275 

4(12. 756686)(. 355320) 18.130823 

K2 “ . 032548 

For the normal curve these criteria have the following 
values: 

ft = 0 
ft = 3 

Ki = 0 

We may conclude, tentatively, that the normal curv^e 
may be used to describe the given distribution.^ 

Fitting a Normal Curve; Use op a Table of Areas 

The process of fitting a normal curve to a set of observa- 
tions involves the computation of theoretical frequencies 
corresponding to the observed frequencies. This may be 
done from a table of areas under the normal curve (see 
Appendix Table I). Using such a table, in the manner indi- 
cated in the preceding section, the areas between the maxi- 
mum ordinate and ordinates erected at the various class 
limits may be determined. By the simple process of subtrac- 
tion the area within each class, and hence the theoretical 
frequencies, may then be computed. The procedure is illus- 
trated in Table 110 on page 446, relating to the distribution 
of telephone subscribers. 

The theoretical distributions derived from this fitting 
process may be compared with the observed frequencies, 
as given in Table 109. Or the comparison of the actual 
distribution and the fitted curve may be made graphically, 
as in Fig. 86. It is apparent by inspection that the normal 
curve gives a fairly good fit to the data, although there 
are several classes in which the differences are marked. A 
natural question arises as to the reason for the failure of 
the normal curve to fit at all points. There are two possible 

* Account is later taken of the bearing of errors of sampling on this con- 
clusion. See Chap. XVni. 
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Table 110 


Illustrating the Computation of Theoretical Frequencies from a Thble 

of Areas 


(1) 

(2) 

(3) 

(4) 


(5) 





Nmnber of 





Deviation 

Proportion of 

cases between 




Class 

from 

area between 2/0 

2/0 and ordi- 

Theoretical frequencies, 

limit 

mean 

and ordinate 

nate 

by classes 



at^ 

atl 





cr 

cr 

cr 




0 

- 3.23 

.4993810 

496.88 




50 

-2.89 

.4980738 

495.58 

0- 

50 

1.92* 

100 

- 2.55 

.4946139 

492.14 

50- 

100 

3.44 

150 

-2.22 

.4867906 

484.36 

100- 

150 

7.78 

200 

- 1.88 

.4699460 

467.60 

150- 

200 

16.76 

250 

- 1.54 

.4382198 

436.03 

200- 

250 

31.57 

300 

- 1.20 

.3849303 

383.01 

250- 

300 

53.02 

350 

- .86 

.3051055 

303.58 

300- 

350 

79.43 

400 

- .52 

. 1984682 

197.48 

350- 

400 

106.10 

450 

- .18 

.0714237 

71.07 

400- 

450 

126.41 

500 

+ .16 

.0635595 

63.24 

450- 

500 

134.31 

550 

+ .495 

.1896931 

188.74 

500- 

550 

125.50 

600 

+ .83 

.2967306 

295.25 

550™ 

600 

106.51 

650 

4- 1.17 

.3789995 

377.10 

600- 

650 

81.85 

700 

+ 1.51 

.4344783 

432.31 

650- 

700 

55.21 

750 

+ 1.85 

.4678432 

465.50 

700- 

750 

33. 19 

800 

4-2.19 

.4857379 

483.31 

750- 

800 

17.81 

850 

4-2.53 

.4942969 

491.83 

800- 

850 

8.52 

900 

+ 2.87 

.4979476 

495.46 

850- 

900 

3.63 

950 

+ 3.20 

.4993129 

496.82 

900- 

950 

^ 1.36 

1,000 

+ 3.54 

.4997999 

497.30 

950-1,000 

.48 

1,050 

+ 3.88 

.4999478 

497.45 

1,000-1,050 

. .15 

1,100 

+ 4,22 

.4999878 

497.49 

greater thaii' 






1,050 ■ : 

' ■ .05 







995.00 

answers to such a question. 

The failure 

to fit may 

be due 


merely to chance fluctuations such as are found in any 


sample. We may have an underlying law of distribution 
of residence subscribers, classified by message use, which 

* The theoretical distribution shows .62 of a case below — 3.23o-. To pre- 
serve formal consistency this amount has here been added to the theoretical 
frequency between 0 and 50. 
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accords perfectly with the normal law of error, but the 
particular sample selected may be marked by certain 
irregularities which would be ironed out if a very large 
number of cases were included. On the other hand, the 
differences may be due to the fundamental failure of such a 
distribution to accord with the normal law of error. Such 
a law may not describe the distribution of telephone calls, 
in which case the normal curve should not be employed. 

At this stage we may note, without discussion, that the 
differences between theoretical and observed frequencies in 



Pig. 86. — IlliLstrating the Fitting of a Normal Curve/ to Freqii.e!iC 3 ''' 
l)istri!;>iitioii of Telephone Subscribers, Classiied aeijordiog to Blessage 
Use 


the present example are small enough to be attributed to 
(hance fluctuations of sampling. The reasoning that, sup- 
ports this conclusion is presented in a later section (Chapter 
XVIII). The evidence is clear, however, that the discrep- 
ancies between the observed frequencies and those in the 
corresponding normal distribution are not excessively large. 
The observed facts are not inconsistent with the hypothesis 
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that residential telephone subscribers, classified according 
to frequency of telephone use, are distributed in accordance 
with the normal law of error. 

This conclusion gives generality to the results of our 
study. We have a great deal of information concerning 
the attributes of distributions following the normal law 
of error, and once the identification of an actual distribu- 
tion with this standard type has been effected we may 
draw upon this store of knowledge. In using the original 
frequency table we are limited to the classes there estab- 
lished. We may now go beyond this and determine how 
many cases may be expected within stated limits. We may 
compute the probability of a case falling between any two 
points on the x-scale, or above or below any given value. 
The observed results, standing alone, are restricted in their 
significance to the particular observations recorded, but 
the theoretical frequencies have no such limitations. They 
apply generally, to the entire population from which the 
sample was drawn. In so far as we are assured of the repre- 
sentative character of our sample we have a basis for 
inference that would be afforded by no amount of study 
of the particular distribution as a thing apart. This fact, 
that a knowledge of the theoretical frequencies permits 
generalization beyond the limits of direct observation, is 
perhaps the most important of the advantages derived 
from the identification of an actual distribution with an 
ideal type, such as the normal distribution.^ 

NOTE ON THE DESCRIPTION OP THE FREQUENCY DISTRIBUTION 

With the aid of the criteria explained in this chapter it is po.ssiblo 
to describe a frequency distribution more accurately than is possible 
with the measurements employed in the earlier chaptens. A treat- 

* As was stated, the normal curve is but one type of frequency curve, though 
one of basic importance. A comprehensive system of frequency curves is that 
associated with the name of Karl Pearson, who has derived equations to and 
has described in detail a number of standard types. An account of other 
fundamental types will be found in the books by Arne Fisher referred to at 
the end of this chapter. 
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ment of this subject is beyond. the scope of the present book, but. it 
seem,s advisable to indicate briefly the nature of these additional 
measures. 

The value of ^2 serves as a measure of the degree of “fiat-topped-. 
ness'' found in a given curve. If 02 = 3, as in the normal type, the 
curve is said to be mesokurtic. If 02 < 3 the curve is platyMiriic^ ot 
flatter than the normal type. If ft > 3, as in the example given 
above, the curve is leptokurtic, or more peaked than the iiorinal. 

A measure of skewness which is more accurate than those give,ii 
early in the book may also be computed from, these criteria, 
Karl Pearson has shown that the quantity 

Vft(ft + 3) 

^ 2(5|32 - 6ft - 9) 

serves a,s a measure of the degree of asymmetiy of a given, curve. 
Inserting the values of 0i and ft given above we liave, in tlie case 
of the distribution based on message use, 

.05558. 

(x is positive if the .mean is ■g.reater than the median, negative if 
the mean is less than the median. In the present case the value of 
the mean is 476.96, that of the median is 482.39, hence the 
skewness is negative.) 

Finally, the distance, d, between the mean and the mode may 
be determined from the relation 

d = X X (T. 

In the distribution described above (relating to telephone use) cr,; 
in original' units, equals 147. 65.. Hence . 

d ^ .05558 X 147.65 - -8.21. 

.Since ... ■ 

Mo'^ M -d . 

we have 

Afo = 476.96 + 8.2r.=' 485. 17. 

This gives a truer approximation to' the,.;.mc)dal value than any of 
the methods discussed in Chapter IV;. ;. 

The methods exemplified in Table 109 and the accompanying 
text provide, therefore, a straightforward procedure for the meas- 
urement of the essential attributes-, of; '.a frequency distribution. 
Thf.^ mean and mode as measurements of central tendency, the 
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standard deviation ^ a measure of dispersion, x as a measure nf 
skewness, and ft 3 as a measure of the degree of concentration 
of observations near the point of maximum frequency mav he 
computed directly from the first four moments of a distnSon 
Ihese methods are available, of course, whether or not a studv 
to be earned to the point of determining and fitting a frequency 
curve of an appropriate ideal type. They are to be recommended 
for use m any systematic study of frequency distributions. ^ 
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CHAPTER XIV 


STATISTICAL INDUCTION AND THE PROBLEM 
OF SAMPLING 

The preceding pages have been devoted to an account of 
tools employed in statistical analysis. Examples illustrating 
the application of these tools to specific problems have 
been presented, but the emphasis throughout has been on 
technique. It is appropriate at this point that we stand off 
a distance, enlarging our perspective, and consider certain 
general problems relating to the application of these tools. 
What is their proper place in economic and business research? 
What are the assumptions involved in using them and 
what are their limitations? What are the end products 
of statistical analysis? How valid are the conclusions 
reached? What restrictions attach to such conclusions? 
We must give thought to such questions as these, if statistical 
methods are to be intelligently applied. 

Statistical Description and Statistical Induction 

In approaching this subject we must first make clear 
the distinction h&tween statistical description and statistical 
induction. By employing the methods of statistics it is 
possible, as we have seen, to describe succinctly a mass 
of quantitative data. Hundreds or thousands of individual 
cases may be classified, and a frequency distribution formed. 
The essence of this distribution may be boiled down to 
perhaps four measures — of central tendency, variation, 
skewness, and kurtosis. A tremendous gain has been realized 
in thus replacing the multiplicity of individual cases by a 
limited number of measures that define the characteristics 
of the group as a whole. The possession of such tools makes 
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it possible for our limited powers of perception to grasp 
the significance of facts in the mass. Again, the methods 
of statistics enable us to describe relations between variable 
quantities. By securing the equation to an appropriate 
curve fitted to the data by mathematical methods, we 
may determine how much, on the average, one quantity 
changes in value as a related factor varies. This may be 
supplemented by a measure of the scatter or dispersion 
about the fitted curve, and by a measure, in abstract 
terms, of the degree of correlation between the dependent 
and the independent variables. 

In so far as the results are confined to the cases actuallji 
studied, these various statistical measurements are merely 
devices for describing certain features of a distribution, or 
certain relationships. Within these limits the measures 
may be used with perfect confidence, as accurate descrip- 
tion.s of the given characteristics. But when we seek to 
extend these results, to generalize the conclu.sions, to apply 
them to cases not included in the original study, a quite 
new set of problems is faced. 

The logical process by which one arrives at generalizations 
from a study of particular cases is termed induction,, as 
opposed to deduction, which involves the drawing of special- 
ized conclusions from general propositions. By statistical 
induction or statistical inference is meant the generalization 
of statistical results, the application to a population of 
measurements derived ixoxa. a, samph. We are emplojing 
this procedure constantly in practical statistical work, 
though not always with a full realization of the assumptions 
inherent in that process and of the limitations attaching 
to it. 

The Natuke of Statistical Inbuction 

The problem at issue in considering the validity of 
statistical induction may be put in the following form; 
A statistical measurement — an average, a frequency ratio, 
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a coefficient of correlation — has been derived from the 
study of sample data drawn from a given population. 
(The term “population” refers to a complete universe of 
things or phenomena having stated characteristics in com- 
mon.) May we assume that, if additional samples were 
taken from the same population, the corresponding measure- 
ments would have the same values? If not, may we deter- 
mine the approximate limits to the fluctuations to be 
expected in these measures, as derived from successive 
samples? Here, obviously, is a problem of supreme impor- 
tance. Karl Pearson has called it “the fundamental problem 
of practical statistics.” If we cannot be assured of a certain 
degree of stability in the results secured from successive 
samples it would be quite invalid to generalize from the 
examination of a limited number of cases. No weight 
would attach to any study except one covering the entire 
universe of things or phenomena composing the given 
population. Yet such all-inclusive studies of economic 
phenomena are practically impossible. Index numbers of 
prices, of wages, of living costs, equations describing the 
relation between the production and prices of given com- 
modities, coefficients of correlation between temperature 
and crop yield — all must of necessity be based on the 
study of samples. The problem of statistical inference, 
in the words of Oskar Anderson, is that of so utilizing the 
samples as to arrive at the best possible approximation 
to the characteristics of the universe. 

We have noted that statistical inference is a special 
form of a general process of reasoning, induction. Two 
points are to be emphasized concerning inductive reasoning. 
First, the conclusion of any induction holds only in terms 
of probabilities. For such a conclusion, by the very defi- 
nition of an induction, applies to cases not included in 
the observations. As opposed to deductive reasoning, in 
which the conclusion is implicit in the premises, induction 
yields a conclusion going beyond the premises. When all 
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the cases to be covered by the conclusions are included in 
the observations, the conclusion ceases to be an induction 
and becomes a descriptive statement. Accordingly, although 
induction is a highly fruitful means of adding to human 
knowledge, it is always hazardous. A leap in the dark is 
always involved, when we apply conclusions to cases not 
yet observed. 

The justification for this leap in the dark, and this is 
the second point we wish to stress, is found in an assumption 
that there is a “limitation to the amount of independent 
variety” found in nature. While there is variation in 
nature, the degree of such variation is limited; there is 
some uniformity in all natural processes. When we are 
dealing with quantitative data this uniformity in natui'e is 
found in the stability of large numbers, as exemplified by 
the curious regularities in such phenomena as birth rates 
or death rates. Nature, in other words, is not marked by 
utter chaos; principles of regularity, order and stability 
appear in all natural processes, and these principles are 
strongly evident when we deal with masses of quantitative 
data. Therefore, when we generalize such a measure as 
an index number of wholesale prices, we do so on some such 
assumption as this: It is reasonable to suppose that, in 
the larger population to which this result is to be applied, 
there exists a uniformity with respect to the characteristic 
or relation we have measured. As a result of this uniformity 
we should expect statistical measurements derived from 
successive samples drawn from this population to fluctuate 
within definite limits. 

It is evident that in making this assumption, in saying 
“It is reasonable to suppose . . . ,” we are introducing 
an hypothesis which is incapable of complete verification 
by purely statistical methods. There is, thus, in every 
statistical induction, an a priori element. The statistical 
conclusion can never stand completely on its own feet. 
It must be endorsed by reason and judgment if it is to 
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carry conviction. If a high positive coefficient of correlation 
were secured from the study of a sample relating to banana 
importations and sales of new life insurance, this would 
not furnish convincing evidence of a causal relation, or a 
relation of contingency, between these two variables. There 
is no reasonable basis for assuming that, in the larger 
universe of phenomena from which the sample was drawn, 
there would be uniformity with respect to this relation- 
ship. 

Statistical inference differs from the general process of 
induction in that a quantitative result is generalized. We 
seek to apply to a larger group — the population — the 
value of mean, standard deviation, or coefficient of correla- 
tion that has been computed from a sample. The measure- 
ment secured from the sample is an estimate of the corre- 
sponding measurement relating to the population. The 
direct task faced in such generalization is that of determining 
the limits within which these estimates would probably 
fluctuate, if based upon a number of different samples 
drawn from the same population. A number defining these 
limits will serve as a measure of the reliability of the given 
results, when generalized to apply to the population. 

We should make clear at this point the sense in which 
the term “population” is used. When we speak of a popula- 
tion we are referring to an aggregate, whether of persons, 
things, or measurements, having certain common charac- 
teristics, or generated by a given system of causes. The 
term may refer to a hypothetical population from which 
a given sample may or may not have been drawn, or to a 
parent population of which a given sample is assumed to 
be representative. It may be a population of prices, or a 
population of cephalic indices; the term is not restricted 
to a population of persons. R. A. Fisher speaks of a “popula- 
tion of possibilities,” referring to the possible results of 
an experiment many times repeated. Of high importance 
in statistics are populations of statistical measurements — 
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means, eoefEcients of variation, standard deviations, etc. 
It is proper to note that the populations to which most 
statistical results apply are infinite in size. Statistical 
generalizations relate to hypothetical universes containing 
infinite numbers of units. We assume a sample to be 
drawn, not from the finite population that might be covered 
by actual enumeration, but from the infinite population, 
or universe, that wmuld be generated if the forces or system 
of causes that brought this sample into being were to 
operate without limit. (Statisticians have given some atten- 
tion to special techniques, appropriate for dealing with a 
finite universe, but problems with which we do not here 
deal are faced in such applications.) 

The principle of the uniformity of nahm is assumed, of 
com’se, to applj'" to the universes from which our samples 
are drawn, if these samples are to be made bases of inductiv^e 
generalizations. We must assume that these universes are 
stable, and that all their attributes are stable. An attribute 
of such a stable universe may not be exactly determined 
from the attribute of a single sample, but measurements 
defining the attributes of numerous samples drawn from 
the same universe will be distributed about the true value 
(i.e., that of the universe) in a systematic fashion. Each 
sample value is, of course, an estimate of the true value 
of the corresponding attribute of the population at large, ^ 
The precise determination of the characteristics of this 
distribution of estimates is essential to the determination 
of their reliability. 

Having knowledge of this distribution we may dc-ter- 
mine the limits within which estimates derived from dif- 
ferent samples of the same population may be expected 
to fluctuate. A measure of these limits will serve as a 

^ By convention, not yet generally adopted, but useful, tlie attribute of 
the population which is being estimated is termed a parameter ^ while the e^ti- 
iiiatc* of it is termed a statistic. Our certain knowledge is limited to statistics- 
We use this knowledge to the best of our ability to provide us witli approx- 
imations to the parameters which we can never know. 


458 INDUCTION AND SAMPLING 


measure of the reliability of the given results, when gen- 
eralized. Such a measure might be seemed by the labori- 
ous process of studying a great many different samples, 
just as the dice were thrown 4,096 times in a preceding 
example. Thus we might desire to test the reliability of an 
average of weekly earnings of a certain class of workers. 
A first average might be secured from a sample composed 
of 250 individual records. This result might be tested by 
computing 499 additional averages, each based on 250 
individual records. These 500 averages would not be 
identical in value, but if they were tabulated a frequency 
distribution closely approximating the normal type would 
be secured. From this distribution we might compute the 
mean of all the averages and the standard deviation of 
these averages. This standard deviation would serve as a 
measure of the variation found in the averages of weekly 
earnings, as computed from successive samples. 

We have noted at an earlier point that a Gaussian or 
normal distribution is generated when three general condi- 
tions prevail. These are: 

a multiplicity of forces affecting each observation 

independence of the various forces affecting each observa- 
tion 

equality of the forces tending to generate values above 
and below the mean value. 

The process of random sampling which would, presumably, 
be employed in securing the successive samples referred to 
in the preceding paragraph should satisfy the conditions 
giving rise to the normal distribution. There should be no 
special or unbalanced influences affecting particular samples. 
The differences between successive samples should be such 
as arise from a combination of forces, intermingled and 
not open to separate definition; that is, from “the mass of 
floating causes known as chance.” If these conditions be 
met, and if the field of observation (i.e, the universe being 
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sampled) be homogeneous, the distribution of means com- 
puted from the successive samples would be normal. 

This is a fact of high importance to statistical inference. 
In the realm of original observations, relating to persons, 
things, or events, normal distributions are the exception, 
rather than the rule. But the measurements which the 
statistician derives from successive samples, and which he 
employs in the inductive reasoning by which he generalizes 
his results, are far more frequently distributed in accordance 
with the Gaussian law. Much of the power of statistical 
instruments derives from this fact. 

The statistical investigator is rarely in a position to 
build up a frequency distribution of constants derived from 
numerous samples. It is generally impossible to take 400 
or 500 successive samples, in testing the reliability of a 
given measurement. As a practicable alternative a proce.ss 
of mathematical deduction is employed, in determining the 
characteristics of distributions of statistical measurements 
derived from random samples, drawn under stated conditions 
from given populations. An example of such mathematical 
deduction is provided by the derivation of the mean and 
standard deviation of a distribution generated under the 
following conditions; 

p, the probability of a given event occurring, is known 

q, the probability of the event not occmring, is known 

n, the number of independent events in a single trial, is 

known. 

Under these conditions, as was noted in the preceding 
chapter, M = rvp, and <r = y/wpq.^ By a somewhat similar 
chain of reasoning, we may determine the characteristics 
of a distribution composed of arithmetic means of a number 
of samples of constant size drawn from a given population. 
The standard deviation of such a distribution is given by 


^ For proofs, see Appendix .B. 
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where ctm is the required measure, a is the standard deviation 
of the population from which the samples have been drawn 
and A is the mmber of observations in each of the samples j 

, j. deduction of the characteristics 

ot distributions of statistical constants derived from samples 
is fundamental to the whole process of statistical inference 
It IS not, of course, a task that needs to be done afresh in 
each statistical investigation. When the law of distribution 
of a given class of statistical measurements has been deter- 
imned, statisticians may utilize the results in their various 
research fields, with due regard to all the conditions under 
which the given law holds This basic task has been per- 
ornaed for most of the statistical measurements currently 
employed. Earlier approximations have been refined in 
recent years for many classes of statistical measurements. 
Ihe statistician today may draw upon a considerable body 
oi tested and verified materials in determining the relia- 
bility of various kinds of statistical estimates. These 
materials east in the form of shorthand expressions for 
the standard errors of different statistical constants, and 
m prepared tables for use when the distributions deviate 
materially from the type defined by the normal law of error. 

Pbactical Problems OF Sampling 

The preceding discussion has dealt with one aspect of 
statistic^ induction. The ai-gument has proceeded on the 
assumption that inferences concerning the attributes of 
a population would be based upon a sample thoroughly 
representative of the universe from which it was drawn, 
ihe securing of such a sample is a first condition of valid 
statistical induction. Practical problems of the first impor- 
ance are faced in the actual field work of sampling. The 
proeedmes employed in such field work lie, in the maiB, 
beyond the scope of the present book, but it is desirable 

^ For proof see Appendix C. 
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that the general nature of sampling techniques be indicated. 
References given at the end of the chapter deal in greater 
detail with these procedures. 

The task of securing an adequate sample calls, on the 
negative side, for an avoidance of bias in the individual 
observations and of preventable errors in schedules and 
tabulations. The term bias is applied to observational 
errors that are cumulative and non-compensating. Personal 
prejudices on the part of reporters, mental attitudes of 
which the subjects may be unconscious, or the mere physical 
conditions of observation may lead to persistent errors 
that distort samples. Errors in recording and tabulation 
are easier to detect. Training of enumerators and careful 
editing of schedules and tables will keep such errors to a 
minimum. 

On the positive side sampling technique is directed toward 
the securing of a sample that is truly representative of the 
universe of inquiry. This is a major task, calling for a 
high degree of care and judgment in planning field opera- 
tions conforming to the ultimate objectives of the study. 
A. L. Bowley has classified, under the four heads distin- 
guished below, methods suitable for use in securing a 
representative sample. 

The method of random selection is employed when the 
entire population to be sampled is treated as a whole, and 
members of the sample are so chosen as to be random 
members of that population. In this selection the indi- 
vidual choices must be independent of one another, and 
the chance of any member of the entire population being 
included in the sample must be the same as that of every 
other member. As regards the conditions of selection there 
should be present no element of preference or bias that 
would tend toward the inclusion or exclusion of certain 
members of the larger group. The general requirement 
here laid down should be interpreted, as J. M. Keynes 
has pointed out, to mean that with respect, to the purpose 
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of the particular investigation the members of the sample 
should be random members of the population at large. 
Intelligent planning is needed in securing a purely random 
sample. The obvious procedure of picking the most readily 
available cases would by no means meet the condition of 
random selection. Certain important elements in the uni- 
verse of facts to which the conclusions are to be applied 
may be excluded through the play of an unconscious bias 
unless careful attention is given to the selection of cases. 

The population from which a given sample is to be selected 
is often not homogeneous, with reference to the purpose of 
particular investigation. Slum districts and wealthy districts 
may both have to be covered, in a study of social or eco- 
noirdc conditions. Agricultural districts differing mate- 
rially in fertility may be included in a farm survey. If, 
by a process of stratification, the universe of inquiry 
may be broken into sub-groups individually more homo- 
geneous than the total population, the reliability of sampling 
results may be substantially increased. Within each sub- 
group random selection may be employed. This method 
is termed stratified random selection. The size of each group 
in the sample should be proportionate to the relative 
importance in the total population of the stratum repre- 
sented by that group. Where homogeneous sub-groups are 
secured by the process of stratification, and where the 
differences between the sub-groups are pronounced, this 
method is distinctly superior to that of random selection 
among the undifferentiated members of the population at 
large. 

In using the third method, that of purposive selection, the 
statistician seeks to secure a sample having the same 
characteristics as the universe of inquiry in respect of one 
or more “control” factors. If these controls are highly 
correlated with the quantities that are the objects of investi- 
gation, this method of selection gives obvious support to 
generalizations based on the study of the sample. As in 
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stratified selection, sub-groups are employed. These sub- 
groups are chosen not at random, but in such a way as to 
possess, in the aggregate, the same attributes (e.g., means, 
standard deviations) as the population at large, in respect of 
the control factors. Deliberate mardpulation, often through 
a process of trial and error, is necessary to effect this agree- 
ment between the sample and the totality. 

When this method is employed the statistician must, 
of course, have information concerning the “controls” for 
the total population. The application of the method is 
restricted to fields in which such knowledge is available. 
Census type inquhies on population, agriculture, and manu- 
factures provide such basic knowledge. Promising work 
has been done in purposive selection in dealing with agricul- 
tural data. 

The fourth method, that of stratijkd 'purposive selection, 
represents a combination of the use of stratification to 
secui’e homogeneous sub-groups and of deliberate selection 
through the use of controls. Where data are open to such 
stratification, and where necessary controls are available, 
the combined procedures may profitably be employed. 

When a representative sample has been secured, when 
errors and bias have been avoided, we may still expect the 
attributes of the sample to differ from those of the total 
population. The effects of fluctuations of sampling will 
still be present, so long as the coverage of the sample falls 
short of the universe of inquiry. We may only estimate the 
attributes of the population; we still face the uncertainties 
that inhere in induction. It is possible, however, to define 
with considerable precision the probabilities involved in 
statistical induction when the differences between the 
attributes of the sample and those of the total population 
are due to fluctuations of "simple sampling,” that is, to 
the scrambled mass of causes that constitutes chance. 
Under these conditions it is possible to assign in advance 
limits within which we may expect statistical measures 
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derived from different samples of the same population to 
fluctuate. This means that we may apply to the population 
at large statistical measures secured from the study of a 
sample, not with confidence in their perfect stability, but 
with fairly definite knowledge of the margin of error involved 
in thus extending our results. Where the necessary condi- 
tions are fulfilled statistical induction is a valid procedure. 

Use of Measures op Eeliability 

Measurements defining the sampling errors to which given 
statistical constants are subject are put to various uses, 
It is in order now briefly to review the standard errors of 
different statistical measurements, and to illustrate their 
applications. 

sampling errors: THE MEAN 
For the standard error of an arithmetic mean we have 


cr 



where the symbol a in the numerator of the right-hand term 
refers to the standard deviation of the population from 
which the sample is drawn and A is the number of observa- 
tions in the sample. Actually, of course, we do not know 
the standard deviation of the population, but we use as 
an approximation to it the standard deviation of the 
sample. The approximation is acceptable except when the 
number of observations in the sample is small, in which 
case special treatment is needed.^ 

Reference has been made above to the fact that a dis- 
tribution of arithmetic means computed from random 
samples of a given population usually follows the normal 
law of error. This is true even though the distribution 
of the population from which the samples are drawn is 
not itself normal. Accordingly, we may interpret given 

1 See Chapter XVm. 



SAMPLING ERRORS 465 


values of ou with reference to the probabilities associated 
with deviations in a normal distribution. 

Table 34 in Chapter V shows the distribution in 1933 
of 11,404 workers in open hearth steel furnaces, classified 
according to their average hourly earnings. The arithmetic 
mean of this distribution is 50.14 cents; the standard 
deviation, which we may here represent by s, is 18 . 6S5 cents. 
Accepting this standard deviation as an approximation to 
the standard deviation of the population from which this 
sample was drawn,* we have 


(Til/ = 


S 


18.6 85 
Vl 1,403 


The true mean of the hourly earnings of wage workers 
in open hearth furnaces in 1933 is not knowm. The figure 
50.14 cents is our best approximation to it. If we should 
draw many samples, each the size of the one we have here, 
we should have many mean values normally distributed 
and centering, we may assume, at the true value. The 
standard deviation of this normal distribution we estimate 


^ The formula for the standard error of the mean, when the a of tlie popula- 
tion is known, is given by <tm == — When the standard de\d,ation of the 

VN 

population is replaced by that of the sample (s), as an appro:xim,atic 3 n to the 
desired quantity, the formula for ctm may be written 


or cTAf ■ 


VN ■ \ ViV- l. 

The fir st of t hese is appropriate if s has . been derived from the relation 

« i / 


N - 1 


(where d is the deviation of a single oliservation from the 


mean); second iS' appropriate if s has been derived from the relation 
/"^(p 

5 ex ^ Ill other words, N should be .reduced by 1 either in the derivation 

of s or ill the derivation of o-jf. If o-^ is derived from the d’s of the original data, 
the single operation is summed up- in. BesseFs formula 

V jY{N ~ 1 )' 

(vSee Whittaker and l\(.MnmtiiCaMdnS:'- 0 f - OhBervati(ms, London, Blackie &; 
Son, 1924, 20r>“20ri.) The reason for the reduction of N is discussed in Chap- 
ters XV and XVIIi, in dealing with degrees of freedom,’' 
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of this standard deviation nr 
standard error, enables us to set limits within which 4 b 
highly likely that the true mean lies. Any statements we 
make about the true mean are to be interpreted with 
reference to this figure. wan 

We ^ght, for example, on the basis of these results 
make the flat statement: The true mean of the populatS 
hes between 49.965 cents and 50.315 cents. (The first of 
these hmits is the sample mean plus one standard error- 
^e second is the sample mean minus one standard error ) 
We may not assert that this statement is certainly true 

It may be true or false. But if we continue indefinitely to 

raw samples from the population in question, computing 
the mean of each and the standard error of that mean, and 
If we make a statement about each similar to that made 
above 68 out of 100 such statements will be true, me 

different statements will 

It is possible to vary the statement according to the 

Rework with. Thus we might 

Itl; population lies between 49.80 

S sirlf u P iiidefinitely large number 

of such statements, each based on the study of a sample 

wZl?h f ® know that 95 out of 100 

would be true. This is the kind of knowledge we have 

about generaJmations based on results obtained from samples. 
The essential facts concerning the mean of the present 
reliability may be summarized in the state- 

hIr+4 f earnings of wage workers in open 

hearth furnaces in 1933 was (in cents) 50.14 ± .175.2 

*^-^ = ^ 14 -( 1.96 X.175) 

50.48 = o0.14 4- (1,96 x .175) 

M area under a normal curve is included within 

error, which is^ 6745 timM probable 

cn IS .0745 times the standard error. In the present example the 
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The standard error of a mean is frequently used, not 
merely as an abstract measure of sampling reliability, but 
as an instrument for testing a given hypothesis. Such 
an hypothesis usually involves an assumed parent popula- 
tion, and the test centers about the question whether a 
given sample could have been drawn from this parent 
population. Let us assume that, on rational grounds, we 
have set up the hypothesis that the mean duration of 
business cycles is five years. We have observations relating 
to 77 cycles occurring in various countries during stages 
of rapid industrialization.* These cycles are distributed, in 


respect of duration, as follows: 


Duration of cycles, 

Ntmber of 

in years 

cycles 

1 

3 

2 

10 

3 

22 

4 

15 

5 . 

12 

6 

8 

7 

2 

8 

2 

9 

2 

10 

1 


77 


The mean duration of these 77 cycles is 4.09 years, and 
the standard deviation of the distribution is 1.88 years. 
For the standard error of the mean we have 




1.88 

VfT^l 



Are these results consistent with the hypothesis that our 
sample of 77 cycles is drawn from a parent population 


probable error of the mean is ,1 IS cents. ' It, is .well, in any caBe^ to specify 
the exact measure of reliability, being used. 

^Cf. W. C. Mitchelly Business Cycles^ The PrMem and Its Setting, New 
York, National Bureau of Business ■ E^earch, „ .1927, 412“416; F. C. Mills, 
‘‘All Hypothesis Concerning the .’Duration of Bustneas Cycles/’ Journal oj 
the American Statistical Associatwn^ December, 1926, ¥ol. 21, 447-457. 
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(i.e., a universe of cycles generated under similar condition.!^ 
with a mean duration of five years? 

^1£ we use M to represent the mean of the sample data 
-M A to represent the hypothetical mean of the universp’ 
to denote the deviation of our sample mean from 
the hypothetical mean, expressed in units of the standard 
error of the mean, we may write 




4.09 - 5.00 
^16 


= - 4.21. 


The figure 216 is, according to our hypothesis, the standard 
deviation of a distribution of arithmetic averages the mean 
value of which is 5.00. If we were drawing from such a 
Oistribution, the mean of our present sample would represent 
a departme of 4.21 standard deviations from the general 
mean. What is the probability of such a departure occurring 
merely as a result of chance? Consulting a table showing 
areas under the normal curve, we find that the area on 
one side of the mean, lying at a distance of 4.21 standard 

the mean, constitutes 1/100,000 
probabilities, 

means that there is only one chance in 100,000 that 

Lv^wm represented by the normal 

curve will faU below the mean value by 4.21 standard 

evia ions or more. This chance is so remote that we say 
the event m question could not occur. With reference to 

conclude that the results are not 
consistent with the hypothesis. We could not have secured 
e samp e values in question had we been drawdng from 
a umverse of cycles with a mean duration of five years. 
Tim results fad to confirm the theory we have set up. 
an Atf ^ cited (1/100,000) .relates to a deviation 
Hofi ^ 1 ^ hypothetical value only. If we wish to 

probability of an observation departing from the 
ypothetical mean value S. 00 by 4.21 standard deviations 
or more, without reference to whether the departure be 
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above or below the hypothetical value, we must double 
the above probability. The chance of such a departure in 
one or the other direction is 2/100,000. Tests of hypotheses 
usually take this latter form. It is customary to ask whether 
a deviation of a stated magnitude could occur, and to 
measure the probabilities involved with reference to devia- 
tions in both directions. 

In using tables of the normal probability integral in 
tests of this type we are generally concerned with the 
probability of occurrence of deviations as great as or greater 
than some stated value (in the above example, .91 yeans, 
or 4.21 standard deviations). This probability is repre- 
sented by areas in the two tails of a normal curve (assuming 
that deviations either above or below the mean are in 
question). The inside limits of these segments are set 
by ordinates erected at distances from the mean equal 
to the deviation in question; the outside limits are at 
infinity. (See Fig. 85, in Chapter XIII, for a graphic 
representation of segments lying beyond stated limits.) The 
usual tables of the probability integral define the areas 
falling within limits set by ordinates at specific points. 
Our concern is with areas beyond these ordinates. Sub- 
traction of the internal area from the whole area (unity) 
will, of course, give the area of the external portion defining 
the probability that is here desired.^ 

If we should be testing the hypothesis that the mean 
duration of business cycles is four years, we derive the 
value of T as follows: 


4.09 - 4.00 
.216 



From the tabulated values of areas under the normal curve 


^See W. Edwards Deming and. Raymond .'T. Birge, the Statistical 
Theory of Errors/’ Meviews of Modern Physics^ Vol. 6, July, 1934, 1331!., for a 
discussion of this probability, which they . designate and tests based on it. 
(In their terminology, u is the- difference ■between the mean of the sample 
and the mean of the assumed population.) .This article includes a chart ( 134) 
for use in detenmning the significance 'of a- ■given deviation. 
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we determine that approximately 67 per cent of all thp 
observations in a normal distribution will deviate from 
the mean value by .42 standard deviations or more. Wp 
interpret this to mean that if our sample of 77 observations 
were drawn from a universe with a mean value of 4 on 
years the chances are 67 out of 100 that the mean of the 
sample would depart from the population mean by 09 years 
or more (We have counted the combined probables 
of deviations above and below the population mean.) In 
other words, a deviation as great as the one we have experi- 

S the^ The results are not inconsistent 

with the hypothesis that the mean duration of business 
cycles IS 4.00 years. They do not, be it noted, prove the 
hypothesis.^ All that we may say of statistical evidence 
on the positive side, is that it is not inconsistent with a 
given hypothesis. Supporting statistical evidence strengthens 
our confidence m the hypothesis, of course. Its ten^lity 

^ ^ on the basis of rational considerations 

as well as empirical evidence. 

.. 1^1 "The significance of 

each test, say Deming and Birge,i “depends not only on 

he^value of P (i.e., the measure of probability appropriate 
to the test) that is found, but also on how much^il known 

Wnir population.” The above 
bSw^r f ^ ^o^-year cycle has no particular rational 
basis (the figure was used here, of course, to exemplify 

that the observed results are not 
inconsistent ^th it is significant in a negative way, but 
Eot estabtah the t™th of the hypotheete. Low values 

livnafK ^ f are inconsistent with given 

fOTmiilf?^*' are iu^y useful in leading us to reject tentative 
formidations of theory. Acceptable values of P, however, 

conrprni knowledge (a priori and empirical) 

recnl«r>^^ body of materials being studied and the 
regularities prevaihng therein. Within the limit of acceptable 

' Loc. cii., 137 . 
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values, indeed, we may accept one hypothesis, rather than 
another for which empirical tests yield a higher value of 
P, because the former is more consistent with the general 
body of existing knowledge concerning the field in ques- 
tion. 

In the two tests we have applied, no difficulty was encoun- 
tered in interpreting the probabilities bearing on the relation 
between the hypothetical mean and the observed facts. 
In the one case the odds were so small as to leave no doubt 
as to the lack of agreement; in the other case the difference 
was clearly insignificant. But many tests will lie on the 
borderline, and we must have some reasonable criterion 
as to the limit of significance. Odds of 1 out of 100 constitute 
one conventional standard. If a given difference between 
hypothetical and observed values would occur as a result 
of chance only 1 time out of 100, or less frequently, we may 
say that the difference is significant. This means that the 
results are not consistent with the hypothesis we have 
set up. If the discrepancy between theory and observation 
might occur more frequently than 1 time out of lOO solely 
because of the play of chance, we may say that the difference 
is not clearly significant. The results are not inconsistent 
with the hypothesis. The value of T (the difference between 
the hypothetical value and the observed mean, in units 
of the standard error of the mean) corresponding to a 
probability of 1/100 is 2.576. One hundredth part of the 
area under the normal curve lies at a distance from the 
mean, on the a:-axis, of 2 . 576 standard deviations or more. 
Accordingly, tests of significance may be applied with direct 
reference to T, interpreted as a normal deviate (i.e., as a 
deviation from the mean of a normal distribution expressed 
in units of the standard deviation). A value for 2’ of 2 . 576 
or more indicates a significant difference, while a value of 
less than 2 .576 indicates that the results are not inconsistent 
with the hypothesis in question. 

There is, of course, nothing rigid about this particular 
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standard. Some statistical workers employ odds of 1 out 
of 20 as a limit, rather than 1 out of 100. With this standard 
we would accept as significant (i.e., not due to ehanceV a 
difference between hypothetical and observed values that 
would occur only 5 times out of 100, or less frequently a, 
a result of random fluctuations of sampling. The value 
of T corresponding to this standard is 1.96. The standards 
of significance actually employed by a research worker mav 
well vary from problem to problem. The investigator useV 
the results of these tests of significance as aids in the inter- 
pretation of his results and in the development of a body of 
theory that is not inconsistent with the evidence provided 
by expenence. In the interplay of deduction and induction 
that marks such a process, no single absolute standard 
or the rejection or acceptance of hypotheses would be 
appropriate. 

The formula for the standard error of a mean, as given 
above, relates to a sample chosen by random selection, 
h or a proportionately stratified sample the standard error 
o the mean, may be derived from the relation 



where (To is the standard error of the same mean as it would 
have been had the N observations been taken at random 
from the umverse of inquiry, and is the standard deviation 
of the averages of the several strata about the average 
of the whole sample.^ In computing the deviation of 
e mean of each stratum is weighted in proportion to the 
Mn^ er 0 . cases ■ in that' stratum'. ^ N is the total nuinber 

of observations in the sample. It is clear from the formula 
that the standard error of the mean of the stratified sample 
sa^te standard error of a corresponding random 

^ ^iements of Statistics, London, 
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SAMPLING EBEORS: MEDIAN ANO QUARTILBS 

The median is subject to greater sampling fluctuations 
than is the mean. The degree of dispersion of median 
values derived from a number of samples of a stated size 
from a given population will be approximately 25 per cent 
greater than the dispersion of the arithmetic means of 
the same samples. More exactly, we have 

= 1-25331 

Estimates of the quartiles, in turn, are less accurate than 
are estimates of the median. For these we have 

doi = ve, = 1.36263^^^^^- 


SAMPLING errors: STANDARD DEVIATION 

In determining the magnitude of the sampling errors to 
which the standard deviation is subject we must distinguish 
between samples drawn from a normally distributed universe 
and those derived in the more general case, in which the na- 
ture of the distribution of the imiverse is unknown. If the 
distribution of the universe is normal we have, as the esti- 
mated standard error of cr, 

s 

y/2N 


(where N—1 has been used in the computation of s). Thus, 
for the universe of residential telephone subscribers repre- 
sented by the distribution in Table 109, we have 


147.7 


3.31. 


The more general formula for the standard error of the 
standard deviation involves the fourth as well as the second 
moment of the distribution: 


0-<r 



Ui — Ms® 

- iV' 
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r„i St*' “ “«>»'■ 

fv ~ A /i, 384. 1183 - a3.9674'l2 

r ^■>o3:967rxii;^ = 0432. 

the moments here employed axe in elass-inler™! 
units, the derived measurement is also in those terms. In 

the original units we have ' 

V,, = . 0432 X 5 cents = .2160 cents. 

^ Alany tests of significance involve the use of standard 

rSilTtv Th ^measurements of sampling 

rehabihty. These are discussed more fully in the chante^ 
on the analysis of variance. ^ 

SAMPLING EHBOBS: COEFFICIENT OP COBBELATION 
_ A number of distinctive problems are faced in generalizing 

CertSn TAf “^f^^ements secured in such studies. 

Certain of these problems are discussed in the succeedinff 

Chap^ xvm deata with import iS 

Sll noir r samples employed are small. 
At this point general methods of measuring the reliability 

of thnt l^fi'' measurements are presented, without certain 
of the quahfications that will be discussed later. 

s a basm formula for the sampling error of the coefficient 
W ‘^^mputed from A pairs of observations, we 


tit right-hand member, is 

«TnI of correlation in the population at large. 

Since we do not know the true r we must use the r of the 

uncorrected m^enteTre^S^^ appropriate to this distribution, the 
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sample as an estimate of the required value. This formula 
may be taken to hold for distributions approaching the 
normal type, when the number of cases included in the 
sample is fairly large — say 50 or more. When the sample 
is small and, particularly, when we are dealing with a 
relatively high coefficient of correlation derived from a 
small sample, the standard error secured from the formula 
cited above may be faulty, and tests of significance based 
on it misleading. The reason for this and means of meeting 
the difficulty are discussed in Chapter XVIII. 

In exemplifying the application of the usual test, we 
may employ results presented in Chapter X, on the relation 
between the discount rates of Federal Reserve banks and 
of commercial banks. The value of r is -f .84, while iV 
equals 1,800. Accordingly, we have 


1 - (-84)^ _ .2944 
V'1,800 - 1 42 .'40 


.007. 


The standard error of r Is frequently used, as are similar 
measurements relating to other statistical constants, to test 
hypotheses. We may put such a question as the following: 
Is the value of r secured from a given sample significant 
of a real relationship between the variables in question 
in the population from which the sample was drawn? 
Putting the question in form more appropriate for testing: 
Is the present value of r consistent with the hypothesis 
that there is no relationship between the variables in 
question in the population at large? R. A. Fisher terms 
such an hypothesis a “null hypothesis.” The purpoisc of 
experiment, in his words, is to give the facts a chance of 
disproving the null hypothesis. 

In a study of the movements of commodity prices, 1,202 
measurements were secured on the timing of advances in 
the prices of individual commodities during periods of 
general business revival. Paired with each measurement 
was a similar observation on the timing of the decline in 
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the price of the given commodity during the succeeding 
period of general business recession.^ We desire to know 
whether there is any relation between the sequence of price 
revival and the sequence of price recession. Is there a 
pattern in price movements during business cycles? Evidence 
of the existence of such a persistent pattern would lend 
support to the view that cycles represent true regularities 
in economic life. 

These 1,202 pairs of observations yield a correlation 
coefficient of + .27. This does now show a pronounced 
degree of relationship. Our chief concern, however, is not 
with the magnitude of r. We wish to know whether the 
result is consistent with the hjqjothesis that the true corre- 
lation is zero. For the standai’d error or r we have 


1 

V'1,202 - 1 


.029. 


By hypothesis, the population value of r is zero, so the 
numerator of the fraction is 1. 

If the true value of r were zero, and the standard error 
of r were . 029, what would the probability be that, as a re- 
sult of chance, we should secure a coefiicient of .27 from 
a given sample? Since this value represents a departure 
of more than 9 cr’s from the hypothetical value of zero, 
the probability that the difference is due to chance is 
infinitely small. We conclude that the results are not 
consistent with the hypothesis that the sequence of price 
change during revival is unrelated to the sequence of decline 
in a succeeding recession. The null h 3 q)othesis is disproved. 

Had the value of T ^in this case T = 

than 2 . 576 the conclusion would of course have been differ- 
ent. In such a case the discrepancy between the sample r 
and the hypothetical value of zero could be attributed to 

^ The Behavior of Prices^ New York, National Bureau of Economic Research, 
1927 , 181 . 
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sampling fluctuations. The result would not be inconsistent 
with the null hypothesis. 

Having established that the results are not consistent 
with the hypothesis that the true value of r is zero, we may 
compute the standard error of r as actually derived. Assum- 
ing now that the sample is drawn from a parent population 
in which r = -t- . 27, we have 


1 - (.27)^ 
V'1,202 - 1 


.027. 


SAMPLING errors: INDEX OP CORRELATION 

The standard error of the index of correlation may be 
approximated from the relation 




1 -P^ 

■\/N — m 


In this’ formula m represents the number of constants in 
the equation of regression. In the example cited in Chapter 
XII, relating to alfalfa yield and depth of irrigation water, 
p is . 80, N is 44, and m has a value of 3. We have, thus 


1 - (.80)" ^ 


.056. 


The use and interpretation of this measure are analogous 
to those of Cr- In the present instance the index of correlation 
is clearly significant.^ 


SAMPLING errors; THE TEST FOR LINEARITY 
As a test for linearity we have been given 
^ ~ if — jA 

But we wish to know whether, in a given ease, the difference 

‘See Ezekiel, M., Methods of Carrdatim Amlyds, N. Y.. John Wiley and 
Sons, 1930, 257-258, for a discussion of the sampling reliability of the index of 
correlation. 



478 INDUCTION AND SAMPLING 


between and r* may be due merely to a chance fluctuation 
of sampling, or to a real departure of the underlying rela- 
tionship from the linear form. As the standard error of f 
Blakeman has proposed 

(Tj. = + 

The use of this measure may be illustrated with reference 
to the problem relating to wheat yield which was considered 
in an earlier chapter. For the relation between wheat 
yield and amount of nitrogen used as fertilizer, we had 

r = + .793 
17 = .965 
N = 193. 

(The uncorrected value of rj should be used here.) 

Therefore 

f = ,72 _ r® = .302. 

Inserting the given values in the formula for o-j-and solving, 
we have 

<rj.= .074. 

With f having a value of . 302, about 4 . 08 times its standard 
error, there can be no question as to the non-linearity of 
the relationship. The difference between 17 ^ and r* is one 
which could hardly be due to chance fluctuations of sam- 
pling. 

The criterion 17 ® — r* is not very satisfactory as a test 
of linearity, since the distribution of f does not follow the 
normal law. The same weakness attaches to the correla- 
tion ratio. As Fisher has demonstrated, the distribution of 
t] does not tend to normality, even with large samples, 
unless the number of arrays is increased without limit. 
Accordingly, the standard error of 77 is of dubious utility. 
More efficient methods of testing for the existence of 
correlation, and for linearity, are discussed in Chapter XV. 
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SAMPLING EREOES: COEFFICIENT OP RANK CORRELATION 

The standard error of the coefficient of rank correlation 
has been given by “Student” as 

" vW^i 

It is notable that this value is independent of the true 
value of prt This standard error may be taken to relate 
to a normal distribution, and interpreted in the familiar 
manner, when N is fairly large, say 45-50 or more. For 
small samples the distribution of p is not normal. In (he 
example cited in Chapter X, dealing with the relation 
between the number of individual income tax returns and 
the number of passenger automobiles registered in 1934, by 
states, we had pr = .94. Since there are 47 observations, 
the value of is given by 


The sample is large enough to justify the assumption that 
the distribution of pr would approximate the normal type. 
The coefficient of rank correlation is clearly significant, 
being more than six times its standard error. 

SAMPLING errors: COEFFICIENT OF REGRESSION 

High importance frequently attaches to the coefficient 
of regression, in dealing with relationships among variable 
quantities. For the standard error of this measurement we, 
have* 

_ _ 

‘"‘"VSP 

where x is a given value of the independent variable, 

^ See Hotelling an'd.Pabst, loc. - at 

*See R. A. ¥idiety^'^SiMisiieal Methods- for Research 1¥ or Edinburgh, 
Oliver and Boyd, sixth .edition, ’IQSd, 134-146. ' 
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expressed as a deviation from the mean of that variable, 
and Sj is the root mean square of the deviations of the 
actual values of y, the dependent variable, from the corre- 
sponding computed values. That is, Sj, is a measure of the 
scatter about the line of regression. ^ 

A test involving the use of <Th may be applied to data 
relating to the average corn yield per acre in Kansas, bj 
years, from 1890 to 1933 (see Table 128, Chapter XVI). 
These yields show a fairly consistent declining trend. A 
Une of trend fitted to the figures for these 44 years is defined 
by the equation 

F = 22.05 - .1074Z 


where Y denotes corn yield per acre and X denotes time, 
in years, with origin at 1889. We wish to know whether 
the coefiicient of regression (i.e., the slope of the line of 
trend) represents a significant departure from zero. The 
hypothesis we are testing is, then, that the true value of 
the coefiicient of regression, in the population from which 
this sample is drawn, is zero — that there has been no 
significant decline in corn yield in Kansas over the period 
in question.^ 

For Sy we secure the value 6.70, for the value 

84.2. Accordingly 


(Th 


6.70 

84.23 


.0795. 


We may denote by the symbol p the coefficient of regression 


I o. = J ^(.y - y<^)^ 

^ V N -2 

where y denotes a given value of the dependent variable and denotes the 
corresponding value derived from the equation of regression. In the computa- 
tion of % for this purpose N must be reduced by the number of constants in 
the equation of regression. 

^ The hypothetical population of which we assume our sample to be repre» 
sentative is the population that would be generated by the forces responsible 
for variation in Kansas corn yields from 1890 to 1933, if those forces, un- 
changed, were to act upon an infinite number of cases. The application of 
this concept, and of the whole probability calculus, to data ordered in time 
Involves some logical difficulties, which are discussed at a later point. 
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assumed in our hypothesis (in this case zero). We wish to 
know whether the deviation of our actual b from this 
hjTJothetical /S may be attributed to chance, or whether 
it is too great to be so explained. This deviation should 
be expressed in units of the standard error of b, in order 
that the probabilities underlying the normal distribution 
may be applied in our reasoning. Using T, as before, to 
denote the deviation in units of <r, we have 

rp _ 6 — j8 _ — ■ 10 74 — 0 
ffb .0795 

= - 1.35. 

The given value of b represents a departure of 1 ,35 
standard deviations from the mean value of zero in our 
hypothetical population. As may readily be determined by 
reference to the table of the probability integral, such a 
deviation might easily occur, as a re.sult of chance alone. 
The results then, are not inconsistent with the hypothesis. 
There is no clear evidence here of a significant decline in 
corn yield per acre in Kansas during the period covered. 

SAMPLING errors: DIFFERENCE BETWEEN MEA.NS 

A problem of sampling that arises rather frequently is 
that of determining whether two samples could have been 
drawn from the same parent population. Obviously, there 
would be some difference between the means of two .samples 
from the same universe, as there would bo between standard 
deviations or coefficients of correlation secured from different 
sampling operations. We may illustiuite th(i procedure 
employed in determining the significance of a difference 
between two arithmetic means. 

Reference has been made above to a sample of 77 business 
cycles, occurring during stages of rapid industrialization. 
Their mean duration was 4.09 years; the standard deviation 
of the distribution was 1.88 years. The same investigation 
indicated that the mean duration of 51 business cycles 
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oeeurring in vaxious economies during early stages of 
industrialization was 5 . 86 years, and that the standard 
deviation of these measurements was 2.41 years. There is 
an indication here that business cycles are accelerated, 
that their average length is shortened, when an economy 
is passing through a phase of rapid industrialization with 
corresponding impetus to technological change. In this 
case the null hypothesis against which we set our facts 
is that there is no difference, in respect of duration, between 
business cycles occurring in the two stages of industrialization 
named. 

The difference between two means is a statistical meas- 
urement subject to a definite law of distribution. If a 
great many pairs of samples were drawn from a given 
population, the value D (i.e., Mi — could be computed 
from the two means of each pair. A frequency distribution 
of the D’s thus secured would follow the normal law. The 
magnitude of the standard deviation of this distribution 
would be a function of the sizes of the samples thus paired 
and of the standard deviations of these samples. We may 
approximate the standard deviation of this distribution of 
D’s from the relation 


or from 

The measurement needed for testing the hypothesis now 
before us is computed from the relation 

= ,/(l-88)^ (2.4ip 

T 76 + 50 

= VTIe^ 

=.4034. 

The value of D, the difference between the two means, is 
5.86 — 4.09, or 1.77. This value of D is to be judged 
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with reference to a hypothetical value of zero. Accordingly, 
for T (the discrepancy expressed as a normal deviate) we 
have 


1 . 77-0 

.4034 


4.39. 


This discrepancy far exceeds the magnitude 2 . 576, corre- 
sponding to odds of 1 out of 100. If the true value of D 
were zero, a discrepancy as great as this or greater would 
occur as a result of chance about 1 time out of 100,000 
trials. The results indicate that the difference between 
the twm means is not due to chance. The fjids arc not 
consistent with the hypothesis that the two saniples are 
drawm from the same population. There Is ti significant, 
difference betw^een the average durations of busiw'ss c,yr*los 
occurring in early stages of industrialization and in later 
stages of rapid industrial change. 


SAMPLING ERRORS : DIPPERENCB BETWEEN PERCENTAGES 

There are occasions when it is desirable to determine 
whether a difference between two proportions (or percent- 
ages) is significant. Using X>p to denote such a difference, 
we have 



where is the weighted mean proportion, q,, is 1 — p„, and 
N\ and A 2 are the total numbers of cases in the two samples 
to which the proportions relate.* (In computing this value 
and applying the corresponding test it is ncjccasary to divide 
percentages by 100, to reduce them to the form of propor- 
tions or ratios.) 

A tabulation of American and foreign business cycles 
by Wesley C. Mitchell has indicated a relative preponder- 
ance of three-year cycles in American experience. Of 32 

‘See Hornell Hart, “The Eeliability of a Percentage," Journal of the 
American Statislical Association, Vol. 21, March, 1928. 
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^erican cycles 10, or 31.2 per cent, lasted 3 years- nf 
134 cycles m other countries 20, or 14.9 per cen^ 

3 years.! Is the difference between these two pereen^ 
great enough to justify the inference that the forceHcS 
upon American business differ from those acting abrIS 
creating a significantly higher percentage of three vel 
cycles? The hypothesK that we test in this case is that tb 
difference is not significant, that the groups of Am • 

2 J *" 

proportion we have " 

_ Nipi 4- NiPi 

_ (32 X .312) 4- (1 34 V T4Q) 

32 4- 134 = • 1804. 

?» = 1 - Po = .8196. 

ship^hZ^^e"^ “ ”” 


•1804 X .8196 


± + ± 
32 ^ 134 


= .005724 
= .0757, 

•'JT.othetical value 

devZ) -tacrepancy (expressed as a normal 

fp ■ 163 — 0 

.07^ 

= 2.15. 

Bst resulTo^nh^ ^ ^ greater might occur, 

^^ance, about 3 times out of 100. Ifourstand- 
EconLc“p2; “f 
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ard of significance is 1 out of 100 we must conclude that 
the difference between the two percentages is not clearly 
significant. The result is not inconsistent with the hj’^pothesis 
we set out to test — that American and foreign business 
cycles are drawm from the same universe, in respect of 
the proportion of three-year cycles occurring. It, is proper 
to say, however, that we are dealing with border line results. 
If our standard of significance were 1 out of 20 we should 
consider the difference between American and foreign 
experience significant. Perhaps we should say that although 
the present evidence does not prowde conclusive proof that 
the two samples come from different univer.ses, there is 
indication of a difference betwx'en the forces affecting the 
relative frequency of three-year cycles in the llnitt'd Btates 
and in foreign countries. Such results call for fuiuher 
research, in order that a more definite conclusion may be 
reached. 

SAMPLING ERRORS AND SIGNIFICANT FIGURES 

In deciding upon the number of figures to be recorded 
as significant, measures of sampling errors are, of course, 
pertinent. A useful general rule laid down by Truman L. 
Kelley follows: In a final published constant, retain no figures 
beyond the position of the first significant fiigtire in one third 
of the standard error; keep two more places in all computa- 
tions.'^ Its application may be illustrated with reference to 
the figures on hourly earnings of 11,404 steel workers in 
1933. The mean, to four places, is 50,1360 cents. The 
standard error of the mean is . 175 cents. One third of this 
is .0583. The first significant figure is in the column of 
hundredths. By the rule, therefore, the arithmetic mean 
should be given as 50 . 14 cents. Two more places, or four 
decimal places in all, should be retained in calculations. 

^ Tbe rule here given is 'the' Kelley ' suggestion as re-phniseci by P. J» Eiilon 
(Science, N. S. VoL 84, No. 2,187, Nov. 27, 1936, 484). I have diaiiiged “one 
Mf the probable error’’ in Eulon’s' statement: to “one third of the standard 
error.” : 
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Some Limitations to Measuees op Sampling Eeeors 
^ The importance of such measures of reliability as hn™ 
been discussed above is, of course, great. With their aid 
we may give precision to our judgments concerning the 

TcTTif in extending statistical ^ 

beyond the hmits of actual observation. Yet limifsitin 
attach to them, and these must not be Jorgotten LTt™!! 
m^hanical application of statistical tests. ^ 

_ eference has been made to limitations relating to the 
size of samples. In the interpretation of most me^ures of 
ampling errors the assumption is made that statistical 
me^urements secured from successive sampTes 
tributed m accordance with the normal law of error When 
the number of cases is large this is approximately true ™ 
though the original data are not so distributed^ But wit^ 
a small number of cases in each sample this 

tn terms of T) is therefore materially altered when we are 
dealing with results secured from small samples. Techniques 
have been dcvdoped, however, for defining samplSS^ 

SSar^r “ —'’.tin: 

dis^i^rercante «<^a^dard errors we have 

from the flucfimf measure only errors arising 

full c!nfoS tnTl, ''T " to be 

nrobabilifv conditions of simple sampling, the 

in 2l parts ooc^^rin^ must be the ;ame 

peiiodricIulH '^“Pled and for all time 

or observatioTits'i’ t^e events (i.e., drawings 

another. The f^r that letely independent of one 

strictly annltTw I formulas are 

injects element ^ conditions have been met 

in Se fitiH f statistical inductions 

L cond be sure that 

ns of simple sampling are actually fulfilled. 
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They are rarely perfectly fulfilled in the handling of economic 
data. The standard errors derived above can give no 
indication of the possibility of fluctuations in successive 
samples due to causes other than those arising from simple 
sampling. Fluctuations due to bias, due to lack of repre- 
sentativeness in the sample, due to persistent errors of 
any sort, quite elude this method of determining probable 
stability. Although some degree of departure from the 
rigid conditions of perfect sampling does not depriv^e the 
measures of reliability of all value, the limitations noted 
must be the constant concern of the statistician. 

The element of time adds one serious difficulty to the 
problem of statistical induction in the realm of economics, 
and in the social sciences generally. A universe that extends 
over time is subject to elements of change that are not 
present among data relating to a cross-section of tinae. 
Conditions of pig iron production, of banking, of foreign 
trade, of income distribution change from year to year, 
even from month to month. We may hardly assume that 
data relating to different time periods reflect the play of 
identical forces. When we deal with data from different 
periods we are, as Oskar Anderson has pointed out, drawing 
from different universes. The structural changes that occur 
in economic organization are manifestations of this state 
of never-ending transition. Accordingly the homogeneity 
of all populations extending over time is suspect. In par- 
ticular are hazards faced when an induction extends to a 
time period not covered by the data of observation. 

The fitting of trend lines, and the use of deviations from 
trend in statistical analysis, represent one effort to overcome 
difficulties arising out of temporal change. It is assumed 
that variations due to trend reflect the deep-seated changes 
that would introduce elements of heterogeneity into the 
particular universe of inquiry, and that deviations from 
trend may be made the bases of statistical inference. The 
effects of some temporal changes are doubtless removed by 


/ 
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this process. But the argument cannot justify the extension 
to a new time period of measures of sampling error based 
on the study of another period, imless it can be established 
that no essential change occurred in the conditions afifectine 
the phenomena in question. The probable errors involved 
in such extension, without the validation noted, are not 
capable of definition. For this extension would involve 
generalizing about one universe from the study of an 
other. 

In the application of statistical methods proper choice 
of objectives, wise planning, and effective field work are 
of at least equal importance with skill in the use of statistical 
techniques. This is especially true as regards problems 
of samphng. Here chief emphasis falls on soundness and 
accuracy in the field work. The problems of field work are 
specialized and particular, arising out of specific problems 
and conditions. Appropriate special knowledge is needed 
for the selection and vaUdation of the sample. 

Much may be done to strengthen a statistical induction 
by making actual statistical tests of the homogeneity of 
toe population and of the stability of sampling results. 
/ : ! successive samples the representativeness 

o statistical measures may be determined; and by testing 
the subordinate elements of a given sample, when broken 
up into significant sub-groups, the inherent stability of a 
sample may be checked. The uniformity of nature in a 
given field is assumed in every induction. The induction is 
strengthened by every piece of evidence that supports toe 
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CHAPTER XV 


THE ANALYSIS OF VARIANCE 

The determination of degi’ee of correlation between vari- 
ables involves, essentially, the comparison of measurements 
of variability. Thus, in the familiar equation 



we are comparing the dispersion about the fitted line of 
regression with the dispersion about the mean of the 
y’s {(t/). Again, if we work with the relation 



we are comparing the dispersion of the computed values 
of y about the mean of the y’s (ctc^) with the dispersion 
of the original observations about the mean of the y’s 
It is logical thus to compare measurements of variation, 
in applying correlation technique, for the purpose of the 
investigator is usually to test an hypothesis concerning the 
forces responsible for variation in the dependent variable. 
He is usually seeking an associated factor which may, on 
some rational basis, be assumed to influence the fluctuations 
of the variable he is treating as dependent. R. A. Fisher 
has developed a procedure to employ in the study of correla- 
tion which is based explicitly upon the analysis of variance. 
We deal in this chapter with certain applications of the 
flexible and powerful instrument Fisher has forged. 

Comparison op Measures of Variabiuty 

We deal first with a simple comparison of two groups, 
in respect of variability. The prices of preferred and com- 
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mon stocks, as quoted on the New York exchanges, may 
be compared, to determine whether they differ significantly 
in variability. Table 29, presented on a preceding page, 
showed the distribution of closing prices on July 25, 1936, 
of 66 preferred stocks, paying annual dividends of seven 
per cent. With this we may compare the distribution of a 
like number of common stocks selected at random from 
those for which prices were quoted on the New York Stock 
Exchange on July 25, 1936. The required values are given 
in Table 111. 


Table 111 

Cffmpanson of Prefemd and Common Stocks in Respect of Price 

Variatiofi 



Degrees 

of 

freedom 

(n) 

Simi of 
squares of 

Mean 

square 

Standard 

Coninion 

logarUhm 

Natural 

logarithm 


dematkms 

from 

mean 

deviation 

(variance) 

deimiion 

a 

of 

standard 
deviation 
(ogio a 

of 

standard 

dewiation 

logeor 


Common 





stocks 

.Preferred 

65 

99,327.28 

1,528.112 

39.09 

1.59207 

3.66590 

stocks 

■(seven, 

percent) 

65 

30,812.20 

474.034 

21.77 

1.33786 

3.08056 


Difference = 0.58534 


Each distribiitfion iiicliides 66 observations. (It is not 
essential to this comparison that the immber of observations 
in the two ^distributions be equal.) In computing the mean 
square deviation we divide the sum of the squared deviations 
from the mean by the, number, of degrees® of freedom^ 
which is here equal to one less; than the total number of 
observations in each distribution, :that is, to N-l. (More 
is said below about the determination of number of degrees 
of freedom.) The standard deviation of the common stocks, 
39.09, is materially greater than the corresponding figure, 
21 . 77 for preferred stocks, but we cannot tell by inspection 
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whether the difference is significant, or whether it merely 
reflects a fluctuation of sampling. A precise test may be 
made by using the coefficient a as a measure of the difference 
in variability. 

This coefficient is equal to the difference between the 
natural logarithms of the two standard deviations. That 
is 

2 = logeCTi - logeO-s. 

It is to be noted that natural logarithms are to be employed. 
Common logarithms on the base 10 may be shifted readily 
to natural logarithms on the base e (2.71828) by using the 
factor 2.3026 as a multiplier. From the entries in the 
last column of Table 111 we derive .58534 as the value 
of 2 . 

If common and preferred stocks were alike, with respect 
to the dispersion of their prices, and if we had sufficiently 
large samples so that sampling fluctuations did not affect 
the measures of variance, the value of z would be zero. 
Is the value we have derived consistent with the hypothesis 
that the true value of z is zero? Could sampling fluctuations 
alone account for a deviation as great as .58534 from a 
true value of zero? If the derived value of z is too great 
to be attributed to sampling fluctuations, the hypothesis 
that common and preferred stocks are alike, with respect 
to the dispersion of their prices, is untenable. 

To determine whether the derived value of z is consistent 
with the hypothesis that its true value is zero, we must 
know something about the distribution of values of z, if 
these were computed from many samples drawn under 
the same conditions. Fisher has shown that this distribution 
is normal, or effectively so, when the two distributions 
being compared both include a large number of observations. 
This is also true when the two distributions include only 
a moderate number of observations, but with n\ and fti 
equal or nearly equal. The standard deviation of a dis- 
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tribution of z’s secured under these conditions, or the 
standard error of 2 , is a function of the two n’s. It may be 
derived from the relationship 

where and are the number of degrees of freedom in 
the two distributions. 

In the present example rii and are both equal to 65 : 
the standard error of s is equal to the square root of the 
reciprocal of 65. We have 

= V- 01538 = . 124. 

The test of the hypothesis that the true value of 2 is zero 
reduces, then, to the question whether a value of .58534 
is likely to be drawn from a normally distributed population 
with a mean value of zero and a standard deviation of 
.124. A value of .58534 represents a deviation of 4.72 
standard deviations from zero (i.e., zja^ = 4.72). A devia- 
tion as great as this occurs so seldom, in random sampling, 
that we may not accept the conclusion that the present 
value represents a chance deviation from zero. The result 
is not consistent with the hypothesis that the true value 
of z is zero. The dispersion of common stock prices is 
significantly greater than the dispersion of the prices of 
preferred stocks paying seven per cent dividends. 

To exemplify a different condition, we may compare the 
dispersion of prices of preferred stocks paying six per cent 
and of preferred stocks paying seven per cent dividends. 
We have 64 quotations on the former, 66 quotations on 
the latter, both relating to closing prices on the Now York 
Stock and Curb Exchanges on July 25, 1936. The figures 
are given in Table 112 on page 494. 

In this comparison the value of z is . 02890. The standard 
error of 2 (the square root of half the sum of the two recipro- 
cals) is .12502. The coefficient z deviates from zero by 
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Table 112 

Comparison of Six Per Cent and Seven Per Cent Preferred Stock 
in Respect of Price Y aviation 



Degrees Sum of 
of squares of 
freedom deviations 
(ri) from mean 

Mean 

square 

Standard 

Natural 

logarithm 



deviation 

ifoariance) 

deviation of sta^idard 
a deviation 

lOQe a 

Ih 

Seven per cent 







preferred 

stocks 

65 

30,812.2 

474.034 

21.77 

3.08056 

. 0163846 ' 

Six per cent 







preferred 

stocks 

63 

28,175.0 

447.222 

2X.15 

3.05166 

.0168730 


Difference ~ 0.02890 Sum « .0312576 


an amount equal to about one fourth of the standard error 
of z {zjcr^ = .23). This, of course, is a deviation that would 
occur very frequently in a normally distributed variate 
with mean value of zero. The result is, therefore, consistent 
with the hypothesis that the true value of z is zero. There 
is no significant difference between six per cent and seven 
per cent preferred stocks in respect of the dispersion of 
their quoted prices. 

The Testing op Variability between Classes 

The comparison of standard deviations provides a means 
of answering questions of another type. Measurements of 
changes in the average selling prices of products of manu- 
facturing industries may be used to exemplify the procedure. 
If we classify manufacturing industries into those producing 
perishable, semi-durable, and durable goods, and compute 
an average of changes occurring between 1929 and 1933 
in the selling prices of the products of each of these categories, 
we obtain the index numbers given in Table 113. 

The average decline in prices was much less among durable 
manufactured goods than among goods of the other classes; 
semi-durable goods suffered the greatest loss. The range 
of variation among the three averages is considerable, but 
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Table 113 


Measurements of Average Changes in Selling Prices, 1929 -- 1933 , in 
Three Groups of Mamifaduring Industries 


Class of industry 

Producing perisliable goods 
Producing semi-durable goods 
Producing durable goods 
All industries 


No , of 

Index of h 

telling pe 

mdMstries 

1929 

1933 

34 

100 

69.81 

26 

1,00 

66.41 

25 

100 

78.96 

85 

ioo 

71.46 


on the basis of the evidence here given we are not abie to 
say whether the observed differences are due to chaiiee, 
merely, or whether the prices of these several classes of 
goods were subject to the play of quite different forces, 
during the period here covered. An objective test is needed, 
before tve may assume that the observed differences ai’e 
significant. 

For the application of such a test we need a measure of 
variation which is independent of the principle of cla.ssifica- 
tion here employed. How much might a series of price 
relatives for 1933, on the 1929 base, be expected to vary 
as a result of the play of chance? (By “chance” we here 
mean the mass of causes unrelated to the factor of relative 
durability.) A measure of the strength of such causes is 
provided by the variation within the three classes we have 
set up. The method used in measuring the variation within 
these classes is indicated in Table 114 on page 496. 

It will be understood that the deviations which, in 
squared form, enter into the sums in the last colunaii are 
the differences between individual items and the means 
of the classes in which those items fall. Thus the relative 
measuring the average selling price of products of the meat 
packing industry in 1933 was 44.90 on the 1929 base. 
This industry falls in the perishable goods group. The 
difference between 44.90 and 69.81 is 24.91. The square 
of this, or 620.5081, is one of the 34 items making up the 
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Table 114 

Illustrating the Measurement of Variation within Classes 


( 1 ) 

( 2 ) 

( 3 ) 

Mean of pnce 

(4) 

Sum of squares of 

Class of industry 

No. of 

relatims 

deviations of indwidml 

indtistn'es 

(1933 071 1929 
base) 

price rehtives from 
class mean 

Producing perishable 

goods 

34 

69.81 

6 , 464,0275 

Producing semi- 

durable goods 

26 

66.41 

3 , 375.1849 

Producing durable 

goods 

25 

78.96 

5 , 725.6916 

Ail industries 

85 


15,564.9040 ■ / 


figure 6,464.0275, the entry for perishable goods in the 
last column of the preceding table. The sum of the entries 
in this last column, 15,564.9040, represents variation in 
price changes within the three classes. It is not influenced 
by factors of perishability or durability, since the total is 
affected only by variation among perishable goods, variation 
awon^ semi-durable goods, and variation omo/iff durable 
goods. 

Eighty-five items enter into this total. However, only 
82 degrees of freedom are present. The 34 perishable 
goods possess 33 degrees of freedom to vary, the 26 semi- 
durable goods possess 25 degrees of freedom, and the 25 
durable goods possess 24 degrees of freedom. For the 
standard deviation defining variability within classes we 
have, therefore 




15,564.9040 

82 


13.78. 


This figure provides us with a yardstick, a measure of the 
degree of variation that is independent of the principle 
of classification employed in distinguishing perishable, semi- 
durable, and durable goods. This measures the variation 
due to the mass of floating causes known as chance. 
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With this standard we may compare the diffei-eiujes 
between the three class averages presented in Table 113. 
The magnitude of these differences may be defined Ijy 
a single measurement, a standard deviation. In its compin a- 
tion the deviation of each class mean from the grand mean 
is measured, and the square of this deviation is mull ) plied 
by the number of items in the class in question. The pi'occ- 
dure is illustrated in Table 115. 


Table 115 


IlluMrating the Measurement of Variation beivmn Chmes 


(1) 

( 2 ) 

(3) 

(4) 

(■>) 

m , , 

Ckm of 
vndmtry , 

iVo. of 
mdustries 

Mean 
of price 
relatives 
(1929 
- 100) 

DeviaMon of 
class mean 
f rom mean of 
all 

8([uare of 
(ktiMion of 
class mean from 
mea n of all 

Wdfjhkfi 

stiarriVfi 

(icn'fifaai 



observatmiB 

(4)2 

m X (5) 

Producing 






perishable 

goods 

34 

69.81 

~ 1.65 

2.7225 

92 ..5650 

Producing 






serni- 






durabie 

goods 

26 

66.41 

-- 5.05 

, 25.5025 

663 . 0650 

Producing 






, durable 
goods 

25 

78.96 

+ 7.50 

56.2500 

l,4iri 251)0 


2,1(11 . bSOO 


The sum of the entries in column (6), 2,161 . SSOO, measures 
Ihc total variation between classes. Although weights 
used in getting this total, the differences relate tt> three 
separate averages, only, and but two degrees of freedom 
are represented in the total. As a measure of the degree 
of variation between the three broad categoiies we have 
set up, we have 



161.8800 
2 . 


32.88. 
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This we may take as a measure of the difference in degree 
of change in selling price from 1929 to 1933 that appears 
to be related to the relative durability of products. 

The next step involves the formal testing of this figure 
against the standard provided by the measures of variation 
within classes. This test is applied in Table 116. Certain 
necessary calculations are also indicated. 


Table 116 

Comparison of Measures of Variation 



Degrees 

Sum of 
squared 
deviations ^ 

Mean 




Nature of of 

variability freedom 

square 

deviation 

Standard 

deviation 

Logm a 

Logoff 


n 


<r 



Between 

classes 

Within 

2 

2,161.8800 

1,080.9400 

32.88 

1.51693 

3.49288 

classes 

82 

15,564.9040 

189.8159 

13.78 

1.13925 

2:62324 

Total 


17,726.7840 


Difference = s = 

0.86964 


The test reduces, it is clear, to a comparison of two 
measures of variability. One, the standard of comparison, 
is the measure of variance vnthin classes, a measure com- 
pletely independent of the perishability or durability of 
the product. The other is a measure of variation between 
classes. Such variation might be due to the same general 
mass of causes responsible for variation within classes, or 
it might be due to special forces related to differences in 
the durability of the goods in question. If the former 
explanation is correct, the two measures of variation .should 
be of the same order of magnitude, with due allowance 

'The figure 17,726.7840, which is the sum of the squared deviations of 
individual observations from their respective class means and of the squared 
deviations of the several class means from the mean of all the observations, is 
equal to the sum of the squared deviations of all the individual observations 
from the mean of all the observations. In the table the total has been broken 
into two components, representing variability between classes and variability 
within classes. 
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for sampling fluctuations. If the second is the correct ex- 
planation the two measures of variation may differ appreci- 
ably in magnitude. The test, therefore, reduces to the 
question: Is the variation between classes significantly dif- 
ferent from the variation within classes, account being 
taken of the degrees of freedom present in the two cases? 

This question could be answered with reference to the 
standard error of z, provided the distribution of z be normal, 
or approximately normal. This is the case wLen the Ji’s 
that measure the number of degrees of freedom are both 
large or when, though of moderate value, they are equal 
or nearly equal. This condition prevailed in the examples 
cited earlier. It is not met in the present instance, so we 
may not with accuracy employ the method of estimating 
and utilizing the standard error of z that was used in the 
earlier case. When the numbers of degrees of freedom are 
unequal and relatively small, as in this case, tests of signifi- 
cance may be most readily made wdth reference to a tabula- 
tion of values of z, prepared by R. A. Fisher. This tabulation, 
gives, for various values of ni and Hi, values of z that would 
be exceeded 5 times out of 100, as a result of chance, if 
the true value of z were zero; it also gives one per cent 
values of z, i.e., values of z that would be exceeded 1 time 
out of 100, under conditions of random sampling, if the 
true value of z were zero. These two sets of values are 
reproduced in Appendix Tables VI and VII of this book, 
through the courtesy of Dr. Fisher and Oliver and Boyd, of 
Edinburgh, his publishers.* 

In the present example the value of z, defining the degree 
of difference between the two measures of variation in 

^ Uses of the function z are discussed in E. A. Fisher’s book, Meiih 

mis f 0 r Research Workers^ Edinburgh, 'Oliver 'and Boyd, sixth eci, 

A table similar to Fisher’s 2-table, but relating, to an alternative measure F, 
has been constructed by George W.,- Snedeoor. ■ F: is derived direct!}’' from the 
variances (i.e., the values of that .are ■'being compared; it is the ratio of the 
larger of the two variances to the smaller. For a table of values of F and a 
discussion of its uses see George W, Snedecor, Statuiical Methods^ Amesi 
Iowa, Collegiate Press, 1937. 
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Table 116 , is .8696. In entering the z-table for the purpose 

of testing this measurement, the number of degrees of 
freedom corresponding to the larger of the two measures 
of variation compared is taken as Ui] is the number 
of degrees of freedom corresponding to the smaller of the 
two measures. This is a necessary procedure, with reference 
to the table as constructed. In the problem that now con- 
cerns us 7ii = 2, «2 = 82. 

For ni =2 and = 60, the 1 per cent value of z is 
.8025; for Wi = 2 and ^2 = «>, the 1 per cent value of z 
is .7636. Interpolating, we obtain .7920 as the 1 per cent 
value of z for Ui = 2, - 82. ^ If the true value of z were 

zero, we should expect a value as great as . 7920, or greater, 
to occur as a result of chance only 1 time out of 100. The 
present value of z materially exceeds .7920; the probability 
of a value as great as this occurring as a result of chance, 
if the true value of z were zero, is less than 1 out of 100. 
The results of the test are not, therefore, consistent with 
the hypothesis that the true value of z is zero. The differ- 
ences between the three class averages .shown in Table 113 
are too great to be attributed to chance. We may conclude 
that the price movements of perishable, semi-durable, and 
durable manufactured goods between 1929 and 1933 were 
significantly different. 

^ Interpolation in the «-table is based upon direct proportions among tlie 
reciprocals of the n's. In the above case 

for ni == 2, n 2 ~ 60; the 1 per cent value of 2 ; = .8025 Ijn^ == 1/60 - .0167 
for Ui = 2, 712 == CO ; the 1 per cent value of z = .7636 1 /rta = i / 00 =: i)000 

A = .0389 A « Tmi 

We must find the 1 per cent value of z corresponding to ui = 2 , ris = 82 . 
For 1 /?i 2 we have 

1/82 = .0122. 

The difference between 1 /82 and 1/ « , for which we must interpolate between 
the given values of z, is .0122 — .0000 = .0122. The required 1 pei* cent value 

/ 0122 "' ■ ■ ^ \ 

of 2 - .7636 -f ^ ^ 

The process of interpolation on the ni scale, if required, would be similar. 
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Variance Analysis in the Measurement of 
Relationship 

The procedure employed in the comparison of measures 
of variability is applicable to the measurement of correlation . 
Indeed, using this technique it is possible to employ a 
systematic procedure that is of great value in revealing 
the character and degree of the relationships prevailing 
between variable quantities. This procedure is illustrated 
in the next section. 

The method employed in applying to a typical correlation 
problem the method of analysis based cm comparison of 
variances may be illustrated with reference In the data of 
alfalfa jdeld previously studied. These are presented in 
Table 117. 

Table 117 

Summary of Results Secured in Experiments with Alfnlfa 

(The measurements in the body of the table measure yields, in tons per acre, 
in 44 experiments) 

Ifiches of irrigatimi waler ajyjMe 


0 

12 

18 

24 

30 

36 

48 

60 

2-35 

4-31.. 

5.69 

6.00 

7.53' 

7.58 

8.05 

5 . 55 

2.75 

4.78 

6.46 

6.89 

7.97 

8.22 

a. 45 

7.25 

2.89 

4.84 

7.02 

7.96 

8.32 

8.63 

8.63 

10.17 

3.85 

5. S3 

8.02 

8.32 

9.43 

9.33 

8.83 

10.70 

5.52 

6.51 


8.38 

9.54 

9.38 

9.52 



Average 7Jf2 9M 11 0 6 12.48 1.0. 62' 

'yield ^ :i88 5.68 15.80 . 7.92 .8.98 . 9.27 ' ' oToi Ym 7 "M 

The average yield of alfalfa, in these 44 expeilments, 
w’as 7 . 48 tons per acre. But there was rather wide variation 
among the results. The sum of the squares of the deviations 
of the 44 observations from the mean is 228. S3. This 
sum sets our problem. We should like to find reasons 
for this variation. 
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TESTING FOR THE EXISTENCE OP CORRELATION 

The observations are set up above in a form suited to 
the testing of one hypothesis concerning the factors affect- 
ing alfalfa yield. The data are arranged in eight arrays, 
classified according to the depth of irrigation water applied. 
This depth varied from 0 to 60 inches. Variations in yield 
appear to be associated with variations in amount of water 
applied. As a basis for our procedure we set up the hypothe- 
sis that there is no such association. To test this hypothesis, 
we may break the sum that measures the total variation 
of yields into two parts measuring, respectively, the variation 
within arrays and the variation between arrays. 

To determine the total variation within arrays, the devia- 
tion of each observation from the mean of the array in 
which it falls is measured. The sum of the squares of these 
deviations, for all the arrays, is the desired total. Thus, 
in the first array of Table 117, the mean is 3.88 tons. 
The deviation of the first observation, 2.35, from this 
figure, is — 1.53; its square is 2.3409. The deviation of 
the second observation, 2 . 75, is —1.13; its square is 
1.2769. Determining in similar fashion the deviations of 
the four other observations in that array from the mean 
of the array, squaring these, and adding the six squared 
values, we have 11.5320 as the sum of the squares of the 
deviations in the first array. Performing similar calculations 
for the seven other arrays, and adding the eight sums thus 
secured, we have a figure of 76.39. This is the total varia- 
tion within arrays. For convenience we may refer to this 
as component A of the total variation. 

In determining the total variation between arrays, the 
deviations of the means of the various arrays from the 
mean of all the observations are measured and squared, 
and the weighted sum of these squares is secured. Weights 
are based upon the number of observations in the several 
arrays. Thus the mean of the first array, 3.88 deviates 
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from the mean of all the observations, 7.48, by 3.60; 
the square of this is 12.9600. Multiplying by six (the 
number of observations in the first class), we have 77.7600. 
Securing similar weighted figures for the seven other arrays, 
and adding, we have 151.94 as the total variation between 
arrays. This we may call component B. 

In breaking up the total variation into two components’ 
we have distinguished variations in yield that are definitely 
not related to differences in depth of irrigation water applied, 
from variations in yield that may or may not be related 
to irrigation differences. Within the first array, including 
six experiments on plots to which no irrigation water was 
applied, yields varied from 2.35 tons to 5.94 tons per acre. 
The total variation within this array (the sum of the squares 
of the deviations from the mean of the array) amounted 
to 1 1 . 5320. Since the irrigation factor w^as constant, this 
sum measures variation which is completely independent 
of changes in irrigation. This is true also of the figure 
76.39, measuring total variation within all the eight arrays 
set up in Table 117. Differences in soils and innumerable 
minor factors combined to create variation within these 
arrays. The figure 76.39 measures the play of that host 
of undefined forces to which we give the name c/iance. 
The one specific factor which does not affect this figure is 
irrigation. We have measured the variation in such a 
way that irrigational differences do not enter. 

Irrigational differences do enter definitely into the varia- 
tion between arrays. Indeed, it may be the dominant 
factor in this variation, which is measured by the figure 

^ Tlie sum of the two components is, -of course, equal to tlicj total 
Variation within arrays (Gomponent' A) 76. 39 

Variation between arrays' (Component B) 1 51.94 

Total variation 228.33. 

For a demonstration of this relationship see 'note, pp. 418”-9, 

To ensure full consistency vbetween - components A and B and the total 
(and among the sub-divisions' of B -later, defined), when these quantities are 
Independently computed, it is' necessary;, that all computations be carried to 
more decimal places than are custoEaariIy:':retamed, 



504 


ANALYSIS OF VARIANCE 


151.94. But of this we cannot be sure. For the means of 
the eight arrays differ among themselves not only because 
of differences in the amounts of irrigation water applied to 
the different plots. To yield differences due to the irrigation 
factor are added yield differences due to the innumerable 
other forces that influence alfalfa yield, the forces we lump 
together as chance. For chance factors affect the means 
of the various arrays, and so affect the variation between 
arrays, Just as they affect the variation within arrays. As 
the experiment was designed, the influence of irrigational 
differences is present only in the variation between arrays, 
but the influence of ‘ ‘ chance ” is present in both the variation 
wittun arrays and the variation between arrays. 

In this fact is found the key to our problem, and the 
instrument for testing our hypothesis. For, in so far as 
chance alone is operative, the variation between arrays 
would be expected to be of the same order of magnitude 
as the variation within arrays. The figures we have so 
far examined indicate that the variation between arrays 
is greater than the variation within arrays. But this may 
be a purely fortuitous result. The apparent increase of 
yield with increased irrigation may be entirely a chance 
phenomenon, similar to a run of heads in tossing a coin. 
This we must test. We must detei'mine whether the forces 
responsible for variation between arrays are the same as 
the forces responsible for variation within arrays. 

The hypothesis we shall test, and which may of course be 
disproved, is that the forces responsible for variation between 
arrays are the same as the forces responsible for variation 
within arrays; in other words, that there is no association 
between depth of irrigation water applied and alfalfa yield. 
The nature of the test to be applied has been indicated 
in the preceding sections. We shall compare two measures 
of variation, to determine whether they are of the same 
order of magnitude. But before this test is applied, account 
must be taken of the number of degrees of freedom pre- 
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vailing in each case. This concept calls for brief explana- 
tion. 

If the data of alfalfa yield related to but one plot of 
land, in one year, there would be no variation. A .single 
observation would coincide with the mean, and the* .st andard 
deviation would be zero. With a second observation oppor- 
tunity for variation arises. But we may think of it. as a 
single oppoi'tunity. With but two observations there is 
but one degree of freedom to vary. With three observations, 
two opportunities to vai-y are given; there are two degrees 
of freedom. In problems of this sort the numlx!!* of dcgre(\s 
of freedom is equal to iV — 1. Our present example inchide.s 
44 observations; hence the total variation 228.. 3d represent .s 
the resultant of 4S degrees of freedom. 

How are these 4,3 degrees divided between the two 
components, A and B'l As regards variation within arrays, 
this may be readily determined by reference to Table 117. 
Variation within arrays, it will be recalled, was measured 
with reference to the means of the variou.s arrays. In the 
first array, containing six observations, there exist five 
degrees of freedom to vary from the mean of that array. 
The same is true of the arrays relating to 12, 24, 30, 36, and 
48 inches of irrigation water. In each of the arrays relating 
to 18 and 60 inches of water there are but four observations, 
with three degrees of freedom. The total of these degrees 
of freedom is 36. Variation between arrays was determined 
by measuring the deviations of the means of eight arrays 
from the general mean of the distribution. Since eight 
different values are involved, there are seven degrees of 
freedom. (The fact that weights were employed in securing 
the total variation between arrays does not affect the deter- 
mination of degi’ees of freedom.) The 36 and the 7, combined, 
use up all the 43 degrees of freedom entering into the total 
variation. 

Knowing these degrees of freedom we may now reduce 
the measures of variation within arrays and of variation 
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between arrays to comparable terms, and determine the 
significance of the difference between them. This is done 
in Table 118. This table and others following differ somewhat 
from those employed in similar comparisons in the opening 
sections of this chapter. In the earlier tables variability 
was measured in imits of the standard deviation, and the 
function « was derived from the relationship 


a = log* 0-1 — log*<r 2 . 

It is often more convenient to perform the necessary calcula- 
tions in terms of the variance, that is, of cr^, and to derive 
s from the relationship 

_ _ logeo-i® - log*(r2^ 

2 : 

The procedures lead to the same result, of course, since 
half the difference between the logarithms of the squared 
standard deviations is equal to the difference between the 
logarithms of the standard deviations, but the use of squared 
measurements eliminates one step in the calculation. 


Table 118 

A Test of the Existence of Correlation 



No . of 

Nature of 

degrees of 

variability 

freedom 

(n) 

Within arrays 

(Component A) 

36 

Between arrays 

(Component B) 

7 ■ 


Sum of 
8qua7'es 

Mean 

square 

{variance) 

Natiml 
logarithm 
of mean 
square 
loge 

76.39 

2.12 

0.7514 

151.94 

21.71 

3.0778 


Difference 

■ ' z 

= 2.3264 
= 1.1632 


When we divide the sums of the squares by the corre- 
sponding figures defining degrees of freedom, we have com- 
parable measures of variance. Now it appears that the 
variance between arrays (21 . 71) is distinctly greater than 
the variance within arrays (2 . 12), in disproof of the hypothe- 
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sis that the same forces account for the two variances. 

But we have a precise test to employ in determining 

whether these two variances are of the sanae degree of 

magnitude, within sampling limits. This is the coefficient 

z, which is half the difference between the natural logarithms 

of the two variances. In the present case, z is equal to 

3.0778 - .7514 
X > or 1.1632. 


If the forces responsible for variation within arrays were 
the same as those responsible for variation betcvcen array.s 
(that is, if our hypothesis were true), the value of z would 
be zero, with a sample of infinite size. The value of z we 
have secured is not zero. This may be proof that our 
hypothesis is false, or it may merely be a result of sampling 
fluctuations. The value of z might be zero in a given infinite 
population, but a random sample would be expected to 
yield results deviating considerably from zero. We wish 
now to take account of sampling fluctuations, in determining 
whether the result we have secured is consistent with the 
hypothesis that the true value of z is zero. 

In determining the significance of the present results 
we enter Appendix Table VI with ni (the number of degrees 
of freedom corresponding to the larger variance) equal t o 
7 and n% equal to 36. Interpolating in Table VI, we find that, 
the 1 per cent value of z corresponding to the stated vahu^s 
of Ki and n-i is .5780.^ A value as great as this or greater 

' It is necessary to interpolate on both scales of the z4ablc. Fir.st, following 
the procedure indicated on a preceding page, we interpolate in respect of wj. 
We obtain 

for Hi = 6, Ms = 36, the 1 per cent value of 2 = .15017; 1 /6 ■= . 1667 
for ni = 8, nj = 36, the 1 per cent value of z = ..WSO; 1 /8 =* . 1 250 

d = .0167 A = r(¥i7. 

We must now interpolate on the n\ scale, since the degrees of freedom arc 
Ml “ 7, »2 = 36. For 1/ni we have 1/7 = .1429. The difference toetween 1/7 
and 1/8, for which we must interpolate between the given values of z, is 
.1429 - .1250, or .0179. 

/ .0179 \ 

The required 1 per cent value of z ~ .5580 + X .0467 j = .5780. 
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would occur only 1 time out of 100, as a result of sam- 
pling fluctuations, if the true value of z were zero. The 
actual value, 1 . 1632, far exceeds the 1 per cent value 
of 3. The evidence strongly indicates that z deviates from 
zero not because of the play of chance, but because the 
forces responsible for variation between arrays are of a 
different order from those responsible for variations within 
arrays. We are justified in concluding that our results 
are not consistent with the assumption that the true value 
of 2 is zero. The hypothesis that the forces responsible 
for variation between arrays are of the same character 
as those responsible for variation within arrays is not 
tenable. The results indicate the presence of a real con- 
nection between alfalfa yield and depth of irrigation water 
applied. 

TESTING THE HYPOTHESIS OP A LINEAR RELATIONSHIP 

Since it appears that there is a relationship between 
these two variables, it is now in order to secure an acceptable 
function, defining the relationship in quantitative terms. 
We may do this by testing, in turn, various hypotheses 
concerning the form of this function, until we secure one 
with which the observations are not inconsistent. We shall 
start with the hypothesis that there is a linear relationship 
between alfalfa yield and depth of irrigation water applied.^ 

The first step in applsdng the present test is to fit a 
straight line to the means of the eight arrays shown in 
Table 117. Variation among these means (component B 
of the total variation) reflects the presence of correlation 

^ Each hypothesis tested should be rational^ acceptable on logical grounds* 
If we are thinking of general relationships, prevailing over the entire range of 
possible observation, the assumption of a straight-line relationship between 
alfalfa yield and amount of irrigation water applied is not tenable. For it is 
not to be expected that increased irrigation will increase yield without limit. 
In the present case we test the hypothesis of a linear relationship in order that 
the demonstration of procedure may be systematic and complete, although that 
hypothesis is not a rational one, even within the range of the present observa- 
tions. 
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between alfalfa yield and irrigation water applied. If the 
correlation is perfectly linear, all these class means will 
fall on the straight line; all the variation between arrays 
will be accounted for by the hypothesis of a linear relation- 
ship. If the relationship is substantially, though not ]>er- 
fectly, linear, the portion of component B not accounted 
for by linear regression will be insignificant. If the regrt'ssioii 
is not truly linear the residue of B not accounted for (i.e., 
the scatter of the means of the arrays about the straight 
line of regression) will be too gi-eat, and some other hypotlu'- 
sis concerning the character of the relaiionshij) helwetui 
alfalfa yield and irrigation water applied mast lx; ein])loyed. 

A straight line fitted by the method of least, sejuares 
to the means of the eight arrays is shown in Fig. 82 on 
page 406. The equation to the line is F = 5 . 038 -f , ()88(LY, 
where Y is alfalfa yield in tons per acre and X. is depth of 
irrigation water applied, in inches. [We should note that 
in the fitting process the mean of each array i.« weighted 
by the number of observations in that arrays This means, 
merely, that six points are assumed to have coordinates 
of 0, 3.88 (equal to those of the mean of the finst array), 
that foiu’ points are assumed to have coordinates of 18, 6.80 
(equal to those of the mean of the third array), etc.] In 
Table 119 on page 510 are given the values of the means of 
the various arrays, and the corresponding computed value.s, 
as derived from the straight line of regression. 

It is clear from the graph and the table tliat the fit of 
the straight line to the means of the arrays is not pciTcct . 
The inadequacy of the fit is measured by the sum of tJu‘ 
squared deviations of the class means from the (^orresponditig 
computed values (each squared deviation being weighted 
by the number of observations in the given class). This 
sum is equal to 44 ,79. 

This sum, to which we may refer as Rs, is one. compdnc'iit 
of B, the variation between arrays. It is that port, ion of 
the variation between arrays that is not accounted for 
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Table 119 

Alfalfa Yield and Depth of Irrigation Water 
(Class means and values based on linear relationship 
Y = 5.038 + .0886Z) 


(1) 

(2) 

(3) 

(4) 

(5) 

Difference 

(6) 

(7) 

Inches 

No. of 

Mean 

Estimated 

between mean 



of 

yield 

yields linear 

y ield of class 



water 

C/Uoo/ 

vations 

of 

relationship 

and estimated 



{class) 

class 

{tons) 

yield 







{Yp- ye) 




/ 

Yp 

Vc 

d 

d^ 

fd^ 

0 

6 

3.88 

5.04 

-1.16 

1.3456 

8.0736 

12 

6 

5.63 

6.10 

- .47 

.2209 

1.3254 

18 

4 

6.80 

6.63 

+ .17 

.0289 

,1156 

24 

6 

7.92 

7.16 

4- .76 

.5776 

3 . 4656 

30 

6 

8.98 

7.70 

+ 1.28 

1.6384 

9.8304 

36 

6 

9.27 

8.23 

+ 1.04 

1.0816 

6.4896 

48 

6 

9.02 

9.29 

- .27 

.0729 

.4374 

60 

4 

8.42 

10.36 

- 1.94 

3.7636 

15.0544 

44.7920 


by the hypothesis of a linear relation between yield and 
irrigation water. The method of deriving the other compo- 
nent of B is shown in Table 120. 

The sum 107. 15, to which we may refer as Bi, is that 
component of the variation between arrays which is 
accounted for by the hypothesis of linear regression. The 
items in col. (3) of Table 120 differ from 7.48, the 
mean of all the observations, for the reason suggested 
by the hypothesis. They differ, on our present assump- 
tion, because with increased applications of water yield 
increases in a manner defined precisely by the equation 
Y = 5.038 •+• .0886X The sum of these variations, 107.15, 
represents, on this assumption, the full effect on alfalfa 
yield of variations of irrigation applications. 

The total of the two sums to which we have referred as 
Bi and B^ is equal to 151.94, the total variation between 
arrays. Working on the hypothesis that the variables 
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Table 120 

Compulaiion of Variation in Alfalfa Yield Attribuiabk to Irrigation 
Differences on the Hypothesis of Linear Regression. 

0 ) 


P? 

35.7210 
11.4264 
2,8900 
,0144 
.2904 
3.3750 
19.6560 
33.1770 
r07Tl520 

with which we are dealing stand in a linear relationship, 
we have broken the component B of the total variation 
into two portions. One of these (Bi) measures the variation 
betw'een arrays that is accounted for by the linear h3T)othesis ; 
the other (JBf) measures the variation between arrays that 
is not accounted for by that hypothesis. We should expect 
some departure from linearity in a sample such as ours, 
even though it were drawn from a universe marked by a 
perfect linear relationship. But there are limits to the 
deviations that might reflect fluctuations of sampling. The 
question we now face is whether is small enough to 1 mi 
accepted as the resultant of random factors, or whether 
it is so large as to represent a breakdown of our hy’^pothesis. 

In our earlier discussion we noted that component- .4 
of the total variation measured the influence of a host of 
random forces affecting alfalfa yield, forces other than the 
irrigation factor. Component A, therefore, serves as an 


( 1 ) ( 2 ) 


(3) 


Inches No, of Estimated 
of ohser-^ yield, linear 
water mtions relationship 
(tons) 


(4) 

Mean yield, 
all obser- 
vations 


, ^ (.5) 
Difference 
between mean 
y ield and 
yield esti- 
mated on lin- 
ear hypothesis 


m 



/ 

Vr/ 

Y 

(dc - F) 
d 

fP 

0 

6 

5.04 

7.48 

~ 2.44 

5+536 

1.2 

6 

6.10 

7.48 

- 1.38 

1 . 9044 

18 

4 

6.63 

7.48 

- .85 

.7225 

24 

6 

7.16 

7.48 

- .32 

. 1,024 

30 

6 

7.70 

7.48 

+ . 22 

. 04:84 

30 

6 

8.23 

7.48 

+ . 75 

.5625 

48 

6 

9.29 

7.48 

+ 1.81 

3.2761 

60 

4 

10.36 

7.48 

+ 2,88 

8. 2944 
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index of the magnitude of random forces, and hence as a 
standard defining the probable limits of sampling fluctua- 
tions, in so far as these are present in component B. We 
may use component A, which relates to variation within 
arrays, as a yardstick in determining whether is attribut- 
able to fluctuations of sampling, or whether it is too large 
to be so explained. 

In comparing components A and B 2 account must be 
taken of the number of degrees of freedom present in each. 
This has already been established for A. The following 
tabular summary of the operations just performed may help 
to explain the relations involved for B^. 


Nature of variability 

Between arrays, due to linear regres- 
sion (Component Bi) 

Deviations from straight line of re- 
gression (Component Bf) 

Total variation between arrays 
(Component B) 


No. of degrees 
of f reedom 

1 

j5_ 

7 


Sum. of Mem 
squares square 

107.15 

44.79 7.47 

151.94 


The seven degrees of freedom entering into component B 
are divided, one to component Bi and six to component B 2 . 
That the points on a straight line vary from one another 
with one degree of freedom is clear from a consideration 
of a linear equation y = a + bx. That the values of y 
may differ is due to the presence of the coefficient 5, which 
defines the slope. If & were zero, the equation would define 
a horizontal fine, with values of y constant. It is the slope 
that constitutes the one degree of freedom among points 
defined by a linear equation. With respect to B 2 , we are 
dealing with eight points, to which a straight line has 
been fitted. If there were but two points both of them 
would lie on the line; there would be no possibility of 
deviation. With three points, one degree of freedom to 
deviate is introduced ; with eight points there are six degrees 
of freedom. The degrees of freedom to deviate from any 
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fitted curve are obviously equal to the number of points 
to which the curve is fitted, less the number of constants 
in the equation to that curve. 

Dividing 44.79 by 6 we may secure, then, the value of 
the variance (the mean square) comparable to the variance 
of component A. A test of our hypothesis again reduces to 
a comparison of variances. This appears in Table 121 . 

Table 121 

A Test of the Hypothesis of Linear Relatiomhip 


Nature of mriabildty 

Degrees of il 

lean square Nniuml itHjurliJim 

freedom 

ipanmice) of uiena sipmra 


n 


Witliiii arrays (Compo- 

iient A) 

36 

2.1.2 .7514 

Deviation, from straight line 
of regression (Conipo- 

nent . 82 ) 

6 

7.47 2,0109 

Difference - i,2o95 

2! - .6298 


The variation within arrays reflects the play of random 
factors, independent of irrigation. The force of these factors 
is indicated by a variance of 2 . 12. If similar random factors, 
independent of irrigation, were responsible for the deviations 
of the means of the eight arrays from the straight line of 
regression, we should expect the variance that measures 
such deviations to be of the same order of magnitude. 
Actually it is much greater, 7.47. But wo cannot sii>-, 
from inspection, that the difference between ihc two vari- 
ances is not due to fluctuations of sampling. An arujurate 
test, is needed. We may compute the coefficient, z, half 
the difference between the natural logarithms of the two 
variances, and apply such a test. 

From the values given we secure a value of 2 equal tf» 
.6298. In determining whether this value i.s .significanily 
different from zero, use must be made again of Fi, slier’ s 
tables. For the values of ni and ns are relatively small 
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and unequal, and the distribution of z under these conditions 
would not be sufficiently close to the normal type to justify 
the use of its standard deviation. Entering Appendix 
Table VI with ni equal to 6, na to 36, we find that the 1 per 
cent value of 3 is . 6047. We take this to mean that, if the true 
value of z were zero, random sampling fluctuations would 
be expected to give a value of z as great as . 6047, or greater, 
only one time out of 100 trials. The actual value of z in 
the present instance is greater than .6047. Only rarely, 
less frequently than one time out of 100, would chance 
account for a value of z as great as the one observed. We 
conclude, therefore, that random forces, of the type respon- 
sible for variation within arrays, are not responsible for the 
deviations of the means of the eight arrays from the straight 
line of regression. These deviations are too gi’eat to be con- 
sistent with the hypothesis that there is a linear relationship 
between alfalfa yield and depth of irrigation water. This 
equation fails to account, adequately, for the observed 
variation between arrays. 

TESTING THE HYPOTHESIS OP A CtTRVILINBAK RELATIONSHIP 

We may now test the hypothesis that a power curve 
of the second degree (F = o -f hX -f eX^) defines the rela- 
tion between alfalfa yield and depth of irrigation water 
applied. The procedure is identical with that followed 
in the case of the straight line. By the method of least 
squares we determine the best values of the constants in 
an equation of the desired form. The curve is fitted to 
the means of the eight arrays, each weighted by the number 
of observations in that array. The derived equation is 
F = 3.539 + .2627X - .002827X1 The curve appears 
graphically in Fig. 82, and the computation of the sum 
of the squared deviations from it is shown in Table 122. 

The inadequacy of the fit is measured this time by the 
figure 4.61, the sum of the squared deviations from the 
power curve of the second degree. This sum, to which we 
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Table 122 

Alfalfa Yield and Depth of Irrigation Water 
(Class means and values based on a power curve of the second degree) 


( 1 ) 

(2) 

(3) 

(4) 

(5) 

(6) . 

(7) 



A'f pfTn 


Difference 



Inches 


irjt 

yield 

of 

class 

Estimated 

between 



of 

No. of 

ykldj front 

nman yield 



water 

ohser- 

equation 

of class 



(elms) 

patimis 

(tom) 

(tom) 

and esfk 
mated yield. 





yp 

Ih 

fp-lh 




f 



■ d 



0 

6 

3.88 

3.54 

+ .34 

.1156 

.fi9.3d 

12 

6 

5.63 

6.16 

~ .53 

.2809 

1 . 6854 

18 

4 

6.80 

7.17 

.37 

. 1369 

.5476 

24 

6 

7.92 

7.98 

~ ,06 

.0036 

. 0216 

30 

6 

8.98 

8.58 

+ .40 

.1600 

,9600 

36 

6. 

9.27 

8.97 

4" -30 

.0900 

,5400 

48 

6 

9.02 

9.16 

~ .14 

,0196 

.1176 

60 

4 

8.42 

8.52 

- .10 

.0100 

.0400 

476058 

may refer as R 4 , is a component of B, the variation between 

arrays. 

It is 

the portion that is not accounted for by the 


hjiiothesis of a curvilinear relationship, of the type assumed, 
between alfalfa yield and irrigation water applied. The 
other component of B is derived by the method indicated 
I in Table 123 on page 516. 

I We may designate by Bs the sum 147.32. This is the 

i component of the variation between arrays that is accounted 

i for by the hypothesis of a relationship defined by a second 

( degree (surve. The items in col. (3) of Table 123 differ 

I from the mean of all the observations, on our present 

assumption, because alfalfa yield varies with increased 
applications of water in a manner defined by the equation 

F = 3.539 + .2527Z - .002827X2. 

We have again broken B, the total variation between 
arrays, into two components, Bs representing the influence 
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Table 123 

Comput-ation of Variation in Alfalfa Yield Attributable to Irrigation 
Differences on the Hypothesis of a Non-Linear Regression 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

Inches 

No. of 

Estimated 

yieldj 

Mean yields 




of 

obser- 

all obser- 




water 

vations 

equation of 
second degree 

vations 






yc 

Y 

yc - F 




f 



d 

d^ 


0 

6 

3.54 

7.48 

- 3.94 

15.5236 

93.1416 

12 

6 

6.16 

7.48 

-- 1.32 

1.7424 

10.4544 

18 

4 

7.17 

7.48 

~ .31 

.0961 

.3844 

24 

6 

7.98 

7.48 

+ .50 

.2500 

1.5000 

30 

6 

8.58 

7.48 

+ 1.10 

1.2100 

7,2600 

36 

6 

8.97 

7.48 

+ 1.49 

2.2201 

13.3206 

48 

6 

9.16 

7.48 

+ 1.68 

2.8224 

16.9344 

60 

4 

8.52 

7.48 

+ 1.04 

1.0816 

4.3264 

147.3218 


of the irrigation factor, working in accordance with a definite 
law, and Bi representing random factors, or random factors 
combined with the irrigation factor. (The irrigation factor 
enters into Bi to the extent that the hypothesis in question 
fails to take account of the true relation between alfalfa 
3deld and depth of water apphed.) This is, of course, a 
different division of B from that resulting from the applica- 
tion of a linear hypothesis. The present division may be 
set down in summary. 


Nature of variability 

No. of degrees 
of freedotn 

Slim of 
squares 

Mean 

square 

Between arrays, due to regression of 
second degree (Component B3) 
Deviations from second degree curve 

2 

147.32 


of regression (Component Bi) 
Total variation between arrays 

' 5. 

4.61 

.92 

(Component B) 

7 

151.93 



The seven degrees of freedom entering into component B 
are now divided, five to component Bi and two to compo- 
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nent Bz- The reasons for this allocation of the degrees of 
freedom are similar to those presented in discussing the lin- 
ear hypothesis. As regards component Bz, the item now of 
chief concern to us, it is clear that when a curve defined 
by an equation with three constants is fitted to eight 
points there are five degrees of freedom to deviate from 
that curve. 

Dividing 4.61 by 5 we secure .92, the value of the vari- 
ance comparable to the variance of component A. For 
again we must use a criterion based on A, in detenniiiiiig 
the limits within which variation due to random factors, 
independent of irrigation, may play. We come again to a 
comparison of variances. 


Table 124 

A Test of the Hypothesis of CurviUnmr Relationship 


Nature of vanabilitij 

Degrees of 
freedom 

M.emi sqwire 
{variame) 

NaUmd logarithni 
of fmetn square 


11 



Within arrays (Compo- 
nent il) 

Deviation from second de- 

36 

2. 12 

,7514 

gree curve of regression 
(Component 

5 

.92 

',-.0834 



Difference = -- .8348 




2 = — .4174 , 


In this case the degree of deviation from the curve of 
regression defined by the power curve of the second dt^grtic 
is actually less than the deviation withirr arrays, wliicdi 
serves as our yardstick. The value of z is therefore lu'galiv't*, 
equal to - .4174. This measure may be tested f<ir .signifi- 
cance by the methods previously discus.scd. The' g-tahlc 
is entered with ni = 36 (the number corresponding to the 
larger of the two variances), ^2 = 5. Intcrpohiting in the 
table for these values we obtain 1.1158 as lire 1 per cent 
value of z. The present value is distinctly less than this. 
The difference between the two measures of variance is 
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not significant. The departures from the curve of regressinti 
may be attributed to » chance/’ that is, to random facton 
independent of the irrigation factor. 

^Tn following this general procedure it is necessary to test 
different h^otheses (i.e., different functions) only until the 
difference between the variance defined by component A 
and the variance defining departures from the curve of 
regression be small enough to be attributed to the nlav 

^ ^ constitutes our standard^ 

the difference between the two variances given in the ore 
ceding table, as measured by 2 , might be positive and as 
great as . 4536, without leading to rejection of the hypothesis 
being tested. It could be as great as .6370 if our sLdard 
0 significance were a P of .Ol.^ A rather exceptionally 
close fit by the second degree curve we have employed gives 

^ actually obtained 

We have arrived, then, at an hypothesis concerning the 
lelation between alfalfa yield and depth of irrigation water 
applied, with which observed facts are not inconsistent. 
Our observations, be it noted, do not estabUsh the truth 
of this hypothesis. Other hypotheses might be equaUy ten- 
able, and perhaps even more closely in accord with the facts.^ 

fl 

being compared ) This larger of the two variances 

the IS eSh fo of m erpolation is applicable over the range „f 

vlL o?;; n iroflo ™ 24 and 

eives the Min T * i J dealing with cases in this region, R. A. Fisher 
gn es the following formulas for approximating the desired qtantities: 

6 per cent value of a = — _ .7843/^1 - i^ 

Vh — 1 ' \ni 

1 per cent value of « = ~ 1. 235 (i - 1 \ 

Y ■'■j.r ' « ■ '■ ■ ' ^ ' \Wl fhJ 

n hese formulas h is the harmonic mean of Wj and rta. That is 


. ■ -a ■■ . 

uld, of course, fit a curve of still higher degree, the equation to which 
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All that we , can say is that the observed facts do not dis- 
prove the hypothesis. If the hypothesis is tenable on, rational 
grounds, we have reached a conclusion, upon wdiicli we may 
rest, for the time. 

summary: variance analysis in the measure of 

RELATIONSHIP 

The procedure employed in the last example may be 
summarized and certain measurements presented which 
show the relation of this procedure to methods discussed in 
preceding chapters. The quantitative results are presented 
in. Table 125. 


Table 125 


Component Elements of the VariahiUty of Alfalfa Yield, and VarimM 

Measures of Correlation 


Total miriobUUy of observations 
relating to alfalfa yield, and 
components of this total 

Total variability (sum of squared 
de%uations from Mean) 228.33 

1. Division of total variabil- 
ity into: 

A. Variation unrelated to 

irrigation factor (i.e., 
variation within ar- 
rays) 76.39 

B. Variation attributable 

to irrigation factor, 
and to other causes, in 
indeterminate propor- 
tions (i.e. variation be- 
tween arrays) 151 . 94 


Test of 
significance 


■z ==,1.1632' 
1 per ce,nt 
value of z 
.5780 


Measwe of 
correlMimi , 


Corre,latiott ratio' 
2 ^ 151 . 94 
^ "" 228113 

« „ f)B54 

n « , 82 


(Footnote 2 contmued from 'imge *518.) 

contained fotir constants, or more, instead of the thr«} conshintis in the ecpia- 
tioii actually employed. The deviations from this curve of liiglier degree would 
be smaller than from the curve of second degree, and z would be correspondingly 
smaller. It is a principle of scientific procedure, however, to employ the sini- 
plest acceptable function. Needless- complexities, whether in the form of un- 
necessary assumptions or of unnecessary. constants in an equation of relation- 
ship, are rigorously avoided. 
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Table 125 — Continued 


Component Elements of the Variability of Alfalfa Yield, and Various 
Measures of Correlation 


Total mriahility of observations 
relaiing to alfalfa yield, and 
components of this total 


Test of 
significance 


Measure of 
correlation 


2. Division of component B 

of (1) above into: 

Bi, Variation attributable 
to irrigation factor on 
the assumption of a 
linear relationship 107 . 15 

B 2 . Variation attributable 
to irrigation factor, 
but not explainable in 
terms of a linear rela- 
tionship, and to other 
causes, in indetermi- 
nate proportions 44.79 

3. Division of component B 
of (1) above into: 

Bs. Variation attributable 
to irrigation factor on 
the assumption of a re- 
lationship defined by 
power curve of second 
degree 147 . 32 

Bi, Variation attributable 
to irrigation factor, 
but not explainable in 
terms of power curve 
of second degree, and 
to other causes, in in- 
determinate propor- 
tions 4.61 


s = .6298 
1 per cent 
value of z 
= .6047 


. 4174 
1 per cent 
value of z 
= L1158 


Coefficient of 
correlation 
, 107.15 

^ 228.33 

= .4693 
r = . 69 


Index of corre- 
lation 
2 ^ 147.32 

228.33 
= .6452 

p = ,80 


The meaning of this summary should be clear, with 
reference to the preceding demonstration. Component A 
of the total variability, being independent of the influence 
of the irrigation factor, is the yardstick, or standard of 
reference, which is used in all the tests of significance noted 
in the second column. Component B, in the first test, 
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is shown to be clearly greater than A, when account is 
taken of the number of degrees of freedom present in the 
two quantities. Thereafter, component B is broken into 
sub-components, first on the hypothesis that alfalfa yield 
and irrigation are related by a linear function, next on 
the hypothesis that the relationship is defined by a power 
curve of the second degree. The evidence is not consistent 
with the first of these hjqjotheses, and it is rejected. (The 
hypothesis would be rejected on rational grounds, as well 
as on the basis of empirical evidence.) The results are not. 
inconsistent with the second hypothesis, and we may accept 
it, subject to the possibility of modification on the basis 
of later experience. 

Three abstract measures of degree of correlation between 
alfalfa yield and applications of irrigation water are given 
in the right-hand column. All of these may be derived 
directly from the quantities employed in the variance 
analysis. Study of the elements of these correlation meas- 
ures, and of the relation of the several measures to the 
corresponding hypothesis, will provide a suggestive review 
of the general problem of correlation. 

We should note here that an assumption of normality 
is implied in the comparison of standard deviations, or 
of variances, in this type of analysis. Minor departures 
from normality do not materially affect the procedure, but 
substantial departures do so. The conversion to other 
forms (such as logarithms or reciprocals) of ohscrvations 
not normally distributed in natural terms will soini^times 
yield nonnal distributions. Where this is pos.sibic, the 
precision of the method of variance analysis is incrcfised 
by such conversion. Limitations arising out of material 
departures from normality may be avoided, also, by the 
use of ranks, as is done in the computation of the coefficient 
of rank correlation. Appropriate procedure.*? have been 
developed by Milton Friedman.* 

1 “The Use of Ranks to Avoid the Assumption of Normality Implicit in the 
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Variance Analysis in Testing the Significance of 
Seasonal Fluctuations 

The methods outhned in this chapter are applicable tn 
certain of the problems encountered in the analysis of time 
peculiarly appropriate in determining 
whether the seasonal fluctuations observed in a given serie! 
represent a true seasonal pattern. Apparent seasonal move- 
ments would be present in any series of observations covering 
a period of years, by months. Chance factors would create 
some differences between averages of all the Januaries all 
the Februmee, etc, even though no true seasonal movsmem 
emsted. We require an objective test, to be used in deter- 
mining whether the differences among such monthly aver- 
The entries in the body of Table 126 are the figures 

pen^ 1918-1927, are expressed as percentages of linear 
trend vrfues. (The original data are given in Chapter Vim 
The anttoetic mean of the ten items for January appears 
the bottom row, with similar means for the other months.^ 
e es or seasonality involves answering the question; 
Do these means differ significantly from 99.9867, the average 

tht'auelt f? 

nortio^'nrfi?^7 f 7 '^®.®l®»ents. We wish to define that 

mn-cvA total variance apparently due to seasonal 

then be appraised with reference to a 

abi1it7f call the residual vari- 
aDinty of the senes, 

_ _In computin g the total variance we may make use of the 

W6-70L American Statistical Association, Vol. 32, 

indices given in^Cha^r precisely the same as the seasonal 

of the monthly representativeness 

the averaging nroeeq*? each month were used in 

the arithmetic mean of all tL for eac^moS" 
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familiar relation -jj- = — where d is the deviation 

of an observation from the true mean, d' is the deviation 
from an assumed mean, and c is the difference between true 
and assumed means. In this case we take the assumed 
mean at 0 on the original scale, and c is thus equal to the 
mean. Since we wish to work with sums of squared values, 
we use the relationship 

Sd* = 2id'y - NcK 

(The mean should be computed to more places than are to 
be finally retained, since the process of squaring and multi- 
plying by N greatly magnifies even slight errors.) 

The entries in col. (16) of Table 126 are the sums of 
the squares of the items in the body of the table. Inserting 
the proper values in the above formula, we have 

= 1,213,250.14 - 120 X 99.98672 
= 1,213,250.14 - 1,199,680.82 
= 13,569.32. 

As in the alfalfa problem discussed above, this total 
may be broken into an element representing variance 
between the monthly means and variance within the several 
months. (Reference here is to the columns of Table 126.) 
The variance between months may be computed directly 
from the monthly means. 

Thus: 

Sum of squares of deviations of monthly means from grand mean 
= 10 X (99.9867 - 90.60)2 -f jq x (99.9867 - 92. 04)^ 

+ 10 X (99.9867 - 96.38)2 4 . ( _ 

-no X (99.9867 - 88.35)2. 

That is, the deviation of each monthly mean from the 
grand mean is squared and weighted by the number of 
items represented by that mean; the sum of the twelve 
figures thus obtained is the required measure of variability 
between months. 
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An alternative shorter method may be employed in 
determining the variance between months, utilizing the 
relationship 

Sd* = - Nc\ 

Here each d' is the mean value for a given month. Each 
squared value must be weighted by the number of items 
represented by the mean. Thus 

S(dO" = 10(90.60)* + 10(92.04)* + 10(96.38)* + . . . 

+ 10(88.35)* 

= 1,207,068.40. 

The correction factor, Nc^, is the same as in the first opera- 
tion. We have, then, 

Sum of .squares of deviations of monthly means from 
grand mean = 1,207,068.40 - 1,199,680.82 

= 7,387.58. 

This sum measures that portion of the total variability 
that may be attributed to seasonal fluctuations. Is it 
significant or does it merely reflect the play of the mass 
of undifferentiated factors we call chance? 

In answering a similar question concerning the alfalfa 
problem we used as a yardstick the variability independent 
of the one factor the effects of which were being studied — 
namely, irrigation differences. In the present case we could 
obtain a measure that is independent of seasonality by 
computing the variability within the several columns of 
Table 126. That is, each item in col. (2) could be deducted 
from the January mean, 90.60, and the sum of the squared 
deviations in this colu m n obtained; a similar sum could 
be obtained for each of the other columns numbered from 
(3) to (13). The grand sum of these figures would be the 
variability within arrays — variability clearly independent 
of seasonal forces since only differences among items for 
the same month enter into it. This sum, measuring varia- 
bility within columns, has a value of 6,181.74. The 
variability between columns plus the variability within 
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columns is, of course, equal to the total. That is, 
7,387.58 + 6,181.74 = 13,569.32. 

The measure of variability within columns will not serve 
in the present instance, however. The yardstick should be 
a measure of the variability due to “chance”— to the 
play of a mass of random factors which may not be observed 
and measured individually. Effects that can be clearly 
attributed to specific causal forces should not be included 
in the yardstick. But some of the variability within months 
may be clearly assigned to changes associated with the 
classification by years. The average of the 12 monthly 
items for 1918 is 108.09; that for the 12 months in 1921 
is 87.63. The former was a year of prosperity, the latter 
one of depression. Clearly, some of the differences among 
the items in the January column, or in the May column, 
are definitely attributable to cyclical forces that raise all 
the monthly figures for one year and depress all the monthly 
figures for another year. (The influence of trend is not 
present, since the items in the body of the table are actual 
values expressed as percentages of trend.) The variability 
within months should be corrected by the subtraction of 
that portion of it that may be attributed to factors affecting 
yearly conditions as a whole. 

The influence of cyclical and other forces affecting whole 
years is measured by differences between the averages for 
1918, 1919, 1920, and the other years covered. These 
averages are given in col. (15) of Table 126. The desired 
quantity may be obtained by the precise methods used in 
measuring the variability between months. We have 

Sd* = S(d')" - 

or. 

Sum of squares of deviations of yearly averages from grand mean 
= (12 X 108.092* + 12 X 98.542*“ + 12 X 103.267* + . . . 

+ 12 X 98.383*) - 120 X 99.9867* 
= 1,203,537.88 - 1,199,680.82 
= 3,857.06. 
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(There will, of course, be ten squared items within paren- 
theses, one for each of the ten years covered by the data.) 
Subtracting 3,857.06 from 6,181.74, the measure of total 
variability within the columns, we have 2,324.68 as the 
balance. This is the desired yardstick. It measures that 
portion of the variability among the original items which 
is clearly independent of the seasonal influence. Secondly, 
it has been corrected by the subtraction of that portion 
of the variability within months which is attributable to 
cyclical and other factors responsible for broad changes 
from year to year. The final balance represents the play 
of forces independent both of seasonal movements and of 
broad swings affecting each yearly value as a whole. This 
residual variability, measmed by the figure 2,324.68, reflect.s 
the play of all those random, undifferentiated factors we 
lump together as chance.' 

This residual variability may be most readily computed by 
subtracting from the total variability the two figures measur- 
ing, respectively, variability between the means of the months 
and variability between the means of the years. At this stage 
of the computation these figures will be in the form of sums 
of squared deviations. The form of organization employed in 
Table 126 on page 528 is convenient for these calculations. 

In the application of the test of significance, account 
must be taken of the number of degrees of freedom entering 
into each of these measures of variability. Table 127 indi- 
cates a suitable procedure. 

* When, as in the present example, the- influences nf the two varisihlen, or 
principles of classification, are independent, it is valid tt) use the* rcwidiml 
ability thus computed as a measure of the strength of randurn fael.oivs. If thcHi** 
infiuences are not independent (if, in terms of the above exiunpic!, Mcasoaul 
movements affecting the monthly averages and cyclical ruoveinmts iiffeetiiig 
the annual averages should be correlated), the residinil quantity will not 
be an accurate measure of truly random factors, Wlieii t].ie residual quantity 
which is used as the yardstick in 'variance analysis is derived from observa- 
tions that are alike in respect of . principles of chissifieatioii (i.ij., wlieri 
the quantity measures variance within -ceils obtained by the application of » 
two-fold principle of classification) this.- difficulty does not arise. Aii example 
of this type is given in Appendix E. 
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Table 127 


Analysis of Variance of Freight Car Loadings and Test of Seasonality 


(1) 

(2) 

(3) 

(4) 

(5) 

Nature of 

No. of degrees 

Su7n of 

Mean square 
(variance) 

(3) ^ (2) 

Natural logarithm 

vanability 

of freedom 
in) 

squares 

of memi square 

Between means 

of years 

9 

3,857.06 



Between means 

of months 

11 

7,387.58 

671.598 

6.50970 

Residual varia- 

biiity 

99 

2,324.68 

23.482 

3.15627 

Total 

119 

13,569.32 

Difference = 3.35343 
3.35343 

Z =■ r 

2 

- 1.67671 


The item 3,857.06 measures the degree of difference 
between 10 yearly averages. Nine degrees of freedom are 
represented in this figure. (The use of weights in computing 
the sum of the squares does not affect the number of degrees 
of freedom.) Similarly, 11 degrees of freedom are repre- 
sented in the measure of variabihty between the 12 monthly 
means. The total variability is computed from 120 items, 
so there are 119 degrees of freedom in all. The number 
of degrees of freedom in the residual variability is, therefore, 
119 — (11 + 9), or 99. 

The variance between the means of months (i.e., the 
mean square) is 671.598. The residual variance is 23.482. 
The test of seasonality reduces to the question: May the 
variance between the means of months be attributed to 
the random forces responsible for the residual variance? 
Unless the variance between the monthly means is signifi- 
cantly greater than the residual variance, no significance 
may be attached to the observed differences between the 
averages for January, February, March, and the other 
months. 
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The test is applied with reference to the measure z, 
which is equal to half the difference between the natural 
logarithms of the two variances being compared. From 
the entries in Table 127 we compute z as equal to 1.67671. 
Referring to Appendix Table VI we find that for ni = 11 
and 112 = 99, the 1 per cent value of z is approximately . 44. 
The present value is distinctly greater than this. The 
results are not consistent with the hypothesis that the 
true value of 2 is zero. There is clear evidence of the existence 
of a definite seasonal pattern in freight car loadings.^ 

The same yardstick may be applied in testing whether 
the differences between the yearly averages are significant . 
The rather wide variations from year to year in the average 
values of the items in the body of Table 126 represent, 
presumably, the play of cyclical forces plus major “acci- 
dental fluctuations” affecting yearly totals. (The trend 
factor, had it not been eliminated, would have combined 
with these other two to create differences among the yearly 
totals.) But are these year-to-year differences great enough 
to be attributed to definite forces other than the chance 
factors represented in the residual variance we are using 
as yardstick? 

The variance between means of years is equal to 
3,857.06 -f- 9, or 428.562. Is this significantly greater than 
23.482, the residual variance? Following the procedure 
illustrated in Table 127 we obtain 1.36352 as the value 
of 2. The 1 per cent value of 2 , for Ui = 9, »2 = 99, is 
approximately .47. The test indicates that the difference.s 
between the annual averages are due to definite! forces 
other than the random factors represented in tkj residual 
variance. 

^ In tlie test liere applied we are, proceeding "on the as.smiiptlo!! that- ilic* 
seasonal pattern is constant from year, to year.; ' ,lf it is not constants the ae* 
curacy of the residual variability, as am^sureof *^chance’^ fact/)W, and of 
the measure of variability between months will be affected, and the signifi- 
cance of the results will be lessened. ’■ If- there -is reason to believe that sciisoiial 
movcniGnts have changed over the period covered, tests of the kind suggested 
in Chapter Vi II should precede the :teBts. here discussed. 
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CHAPTEE XVI 


THE MEASUREMENT OF RELATIONSHIP: 
MULTIPLE AND PARTIAL CORRELATION 

In dealing with methods of defining correlation in the 
preceding chapters we have been concerned with problems 
involving only two variables, a dependent variable and a 
single independent variable. We have found, in certain 
cases, a fairly high degree of correlation between the two 
variables studied. But it is obvious that, in general, 
economic phenomena are affected by more than one factor, 
that the fluctuations in a single variable may be due to 
the interaction of many forces. In dealing with just two 
variables all other factors are ignored, on the assumption, 
usually, that in the single independent variable are found 
the most important causes^ of fluctuations in the dependent 
variable. Thus, in the alfalfa example given, the effect 
upon yield of but a single factor, irrigation, was studied. 
Yet variations in rainfall and temperature must have 
affected the yield in the different years studied. Similarly, 
variations in practically every factor dealt with in economic 
analysis are traceable to more than one cause. If our 
analysis is to be complete we must employ methods which 
will enable more than two variables to be handled at a 
time. We need instruments that will assist us in measuring 
the combined effect upon a single variable of a numlw 
of factors. Such instruments may be secured by a simple 
extension of methods already familiar. 

In Table 128 are presented figures showing the yield 
of corn, per acre, in Kansas from 1890 to 1933, together 

^ This should not be. taken to mean that the coefficient of correlation mms- 
ures or establishes causab relationships, 
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Table 128 


( 1 ) 

Year 


1890 

1891 
,1892 

1893 

1894 

1895 

1896 

1897 

1898 

1899 

1900 

1901 

1902 

1903 

1904 

1905 

1906 

1907 

1908 

1909 

1910 

1911 

1912 

1913 

1914 

1915 

1916 

1917 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

1928 

1929 

1930 

1931 

1932 

1933 


Corn Yield and Temperature in Kansas^, 1890-1933 


Average 


acre, in 
bushek 
X, 

15.6 

26.7 

24.5 

21.3 

11.2 

24.3 

28.0 

18.0 

16.0 

27.0 

19.0 

7.8 

29.9 

25.6 

20.9 

27.7 

28.9 

22.1 

22.0 

19.9 

19.0 

14.5 

23.0 

3.2 

18.5 

31.0 

10.0 

13.0 

7.1 

15.2 

26.5 

22.2 

19.3 

21.7 

21.7 

16.6 

11.0 

30.0 

27.0 

17.5 

12.0 

17.5 

18.5 

11.5 


Average 

June 

tempera- 

ture 

77.6 

70.7 

73.4 

74.7 

74.2 

71.7 

74.1 

76.6 

75.0 

73.9 

74.9 

77.3 

70.9 

67.2 

70.4 

75.5 

71.8 

72.0 

72.1 

73.1 

72.2 

80.5 

69.3 

74.2 

78.2 

69.2 

70.3 

72.8 

78.4 

72.3 

72.8 

74.4 

75.2 

73.3 

74.3 

77.7 

72.5 

70.9 

67.7 

72.2 

73.1 

78.1 

74.3 

80.5 


A 

Average 

July 

tempera- 

ture 

Xs 

83.1 

74.0 

77.5 

79.5 

77.8 

74.9 

78.1 

80.2 

77.7 

76.2 

77.9 

85.0 

76.8 

78.3 

75.6 

74.5 

73.8 

78.4 

75.8 

78.1 

79.5 

78.6 

79.9 

82.1 
79.9 

74.0 

81.2 

80.8 

78.3 

80.2 

77.6 

79.2 

77.0 

79.4 

75.1 

79.7 

78.4 

76.9 

78.1 

78.8 

81.7 

80.6 

81.8 

81.4 


( 5 ) 


Atigud 

tempera- 

ture 

X> 

76.1 

75.1 

76.5 

73.8 

78.0 

76.0 

78.7 

76.0 

78.2 

80.6 

81.0 

79.1 

78.2 

76.3 

74.6 

78.7 

76.3 

78.1 

76.2 

80.1 

76.7 

76.4 

77.4 

84.2 

78.2 

70.1 

79.6 

73.4 

82.3 

78.3 

72.9 

78.6 

80.1 

78.3 

79.0 

77.4 

79.1 

73.1 

77.1 

78.9 

80.3 

76.1 

79.2 

76.8 


Ky\Jt O X ' ' A ' ' ' 
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with the average June, July, and August temperatures for 
each of these years. 

The Relation between Coen Yield and Tempebatueb : 
Pkeliminaey Analysis 

It is known that corn yield is affected by the temperature 
during the growing season. The object of the present, 
study is the determination of the precise relation between 
yield and temperature during each of the three months 
given, in order to secure a basis for estimating the yield 
from a knowledge of the temperature. As certain growing 
months are more important than others, the relation 
between temperature and yield may be determined, first, 
for each of the three months separately. 

The equation which describes the relationship between 
yield per acre and June temperature will be of the type 

A] — ffl -f- bl2^2. 

The equation describing the relationship between yield per 
acre and July temperature will be of the type 

Ai = a + 613A3. 

(In each case Yi represents average corn yield per acre, 
for the State, w’hile X 2 , Ys, etc., represent the absolute 
temperature, in degrees Fahrenheit.) Instead of using to 
represent the variables the symbols V and Y, as in the 
preceding examples, Yi, J’n, Xz, etc., are employed, Xi 
representing in this case the dependent variable. The 
symbol for the constant representing the slope (the cotifficient 
of regression) is, in the first instance above, bis- The 
subscripts 1 and 2 indicate the variables to which this 
constant refers, the first subscript always representing the 
dependent variable (Yi in the example cited), the second 
the independent variable (Ya in the illustration aboveb 
These subscripts are necessary to distinguish the different 
constants when several variables enter into the problem. 
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The meaning is precisely the same as in the former examples 
when no subscripts were needed because only two variables 
were dealt with. 

Solving the proper normal equations for the constants 
in the equation which describes the average relationship 
between yield per acre and June temperature, we have 

= 100.35 - 1.096Z2. 

The value of Sn may be determined from the formula 

„ S(Zx^) - aS(Zx) - 5 x2S(ZiZ2) 

= 

(The subscripts to S, and those to r which appear below, 
have the same meaning as those employed in the preceding 
paragraph.) Substituting the given values, and solving, 
we have 

5^2 = 33.593 

and 

5x2 = 5.80. 

The significance of the standard error, 5, as a measure 
of the reliability of estimates based upon the equation of 
relationship, has been fully explained. In judging of the 
usefulness of the equation, 5 x 2 should be compared with (Ti 
( the standard deviation of Zx) which may be looked upon 
as a measure of the reliability of estimates based upon the 
arithmetic mean of the variable Zx. For this we have 


<rx = 6.68. 

Clearly, the estimates from the equation are more reliable 
than those based upon the mean. The coefficient of correla- 
tion, r, expresses this relationship in abstract terms. We 
may get this value from the equation 

_ aS(Zi) -b 5 x2S(ZxZ 2) - Zcx^ 

“ S(Xi^) -Nci^ 

Solving for r, and giving it the sign of 612 , we have 


=: — . 4984 . 


CORN YIELD AND TEMPERATURE 535 

These values indicate a negative correlation, though not 
a high one, between yield per acre of corn and June tem- 
perature in Kansas. Let us see if the estimates could be 
improved if based upon the temperature in July instead 
of in June. 

The values needed in this study may be computed from 
Table 128. Solving for the constants in the equation of 
regression, we secure the equation 

X, = 166.07 - I. 866 Z 3 . 

For the standard error, we have 

<Si 3 = 4.81 

and for the coefficient of correlation 

= - .6948. 

We have here a closer relation and a better basis for 
estimates than in the ease when June temperature was 
considered. 

Repeating the process for yield per acre and August 
temperature, we have 

Xi = 119.45 - 1 . 288 X 4 
Si4 = 5.78 
ri.i = - .5013. 

August temperature, it is e\’ident, also affects the corn 
yield in Kansjis, a low temperature conducing to yield 
above normal. The relationship is not so fdose as in tlw^ 
Cixse of July temperature, but it is still significant. What 
is needed now is some method of combining these three 
factors, in order that an estimate may be based upon a 
knowledge of their influence, in combination, upon the 
yield of corn. The addition or averaging of the temperatures 
in the three months will not do, for July is obviously more 
important than either of the other months. The principle 
of the method by which this may be accomplished is simple. 
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The Estimation op Coen Yield from Three 
Independent Variables 

The estimating or regression equation in the present case 
will be one in which there is a single dependent variable 
(corn yield) and three independent variables. It will be 
of the form 

= a + &12.34V2 + + bu.i&Ki. 

If we can determine the values of the four constants, we 
may substitute given values of Xs, X3, and X4 in the equa- 
tion and thus get an estimate for Xi in precisely the same 
way as when two variables are dealt with. The method 
of least squares affords the means of solving for the required 
constants. 

The symbols require a word of explanation, as a perfectly 
simple equation is given a rather ponderous appearance 
by all the subscripts employed. The symbol 612, it has been 
explained, represents the coefficient of regression of Xi on J2 
(i.e., the slope of the line describing their relationship, Xi 
being dependent) when these two variables alone are 
included in the study. The s5anbol bn.M represents the 
coefficient of net regression of Xi on X2. The addition of the 
subscripts 3 and 4 to the right of the period means, simply, 
that the variables X3 and X4 have been included in the 
study and the effects of their variations eliminated, in so 
far as this one constant (5i2.s4) is concerned. This constant 
measures the weight which must be given to the variable 
X2 in an estimate of Xi based upon the three independent 
variables, X2, X3, and X4. It will not, of course, be the 
same as bn, which indicates the weight given to X2 when 
an estimate of Xi is based upon X2 alone. Similarly the 
constant 613.24, the eoelEcient of net regression of Xi on X3, 
measures the weight given to Xs when X2 and X4 are also 
included. Each coefficient represents a single, simple con- 
stant, but the subscripts are necessary in order that the 
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precise meaning of thds constant may be clear. The subscript s 
to the left of the period are termed primary snbscripls, 
those to the right secondary subscripts. 

FORMATION AND SOLUTION OF THE NORMAL EQUATIONS 

The first task^ is the securing of the normal equations 
required in solving for the constants in the estiraatiiig 
equation given above. Following the usual procedure-’ we 
have: 

I S(Xi) = Ad + + &13.242(.Ya) + 6l4,232f.Y4) 

II SCXlZs) = 0 SCX 2 ) + 6i2.34S(AV) + 

+ 6u.232(X3X4) 

III S(ZiZ,,) = aS(Z,,) + 6 i»,,,,S(Z2X3) + hnM'S(Xx‘) 

+ hu.23^(X^d 

IV S{ZiZ4) = aS(Z4) + bn.u'XiX^Xi) + bn^^iXXi^ 

+ 5i.(.23S(Z4“). 

The given values might be substituted in these simul- 
taneous equations and solutions secured directly for the 
four constants. It is possible to reduce the number of 
normal equations by one, however, and thus lessen mate- 
rially the labor of computation. This is done by using 
deviations from the arithmetic mean for each variable 
instead of absolute values, getting rid in this way of the 
constant term a in the original equation. 

If we let Ai, As, Aa, etc., represent the arithmetic means 
of the different variables while Xi, Xs, Xa, etc., represent 
deviations from the means, we may replace the ab.«olutc' 
numbers Zi, Xs, Xa, etc., by their equivalent. s, .r, + /b, 
Xs + As, Xa + Aa, etc. Making these substitutions in the 
noimal equations, certain algebraic simplifications are pos- 

>The approach to the problem of multiple correlation which is here taken 
follows that of H. E. Tolley and' M. J. B. Ezekiel “A Method of Handling 
Multiple Correlation Problems,” Journal of the American Statinliml Aaaacia- 
tion, December, 1923, 993-1003. 

’ Cf. Appendix A for a discussion of this procedure and of the method? em- 
ployed in simplifying the normal equations. 
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sible which eliminate the first of the normal 
and reduce the others to the following form: 

-Jf - = 012.34 d ^ 0l3.24 H Y~ 

'2(xiXs) Sfeais) ,, , S(a:3^) j, , S ( 0 : 3 X 4 ) 

—Y- = ■■ ^“.34 + — ^ &13.24 + — 

S(XlX4) S(X2X:4} I S(X3X4) j irf(X4^) 

___ := O 12.34 + -Jf- O 13.24 + 


equations, 

014.23 

0 14.23 

0l4.23. 


All the variables in the above equations refer to deviations 

from the respective arithmetic means. Therefore '~~ 

is simply the mean product of the variables Xi and Xj, 

— is 0 - 2 ®, etc. Representing the various mean products 

by the symbols pi 2 , pn, etc., and inserting the symbols 
for the squares of the standard deviations, we secure, for 
the normal equations: 

P 12 = 0'2*®6i2.34 + P 23 O 13.24 + P240l4.23 
Pis == P230l2.34 + <r3%l3.24 + P340l4.23 
Pl4 = P24O12.34 + P340l3.24 + 0'4%4.23. 

This is the most convenient form for the solution of the 
normal equations. 

From the data, as arranged in Table 128, the following 
values are derived: 


S(Zi) = 863.9 
SfZs) = 3,241.5 
SfZa) = 3,453.4 
S(Z4) = 3,409.1 

S(ZiZ2) 

S(ZiZ3) 

S(ZiZ4) 

S(Z2Z,) 

S(Z2Z4) 

2(Z3Z4) 


S(Zx**) = 18,928.17 
S(Z2^) = 239,209.57 
S(Z3") = 271,317.92 
S(Z4^) = 264,433.19 
63,198.42 
67,295.48 
66,550.84 
254,544.98 
251,246.54 
267,649.61 


Cl 
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N 

= 19.6341 
C2 = 73.6705 
cs = 78.4864 
C4 = 77.4795 


Ci^ •■= 385.4979 
= 5,427.3426 
c.,= = 6,160.1150 
C4= = 6,003.0729 


From these values, the quantities necessary for the .solu- 
tion of the normal equations may be readily determined. 
These quantities are brought together below: 


cri=* = 


N 


Cl- 


18,928.17 „ 

■ — 44 — ' ~ 385.49/9 = 44.68/8 


cri^ = 


239,209.57 

44 


5,427.3426 = 9.2385 


,,»i^2_e.,60.n,50.6.2m3 

2 =, 264j^JJ _ e_003.0729 = 6.7723 


Pl 2 = ^ - C1C2 

= _ 5^^440 453gg ^ _ jo.i263 

Pu = - 1,541.0098 = - 11.5671 


Pl4 


- 1,521.2403 = - 8.7213 

44 ’ 


_ 5,782.1323 = 2.9808 
_ 5,707.9535 = 2.1951 
pzi = - 6,081.0870 = 1 .8586. 
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Substituting in the normal equations, we have; 

~ 10.1263 = 9.2385612,34 + 2. 98086i,,. 24 + 2 .I 95 I 614.23 

- 1 1 . 5671 = 2 . 98 O 8612.34 + 6 . 20136i3.24 + 1 . 8586 & 14.23 

- 8 . 7213 = 2 . 195 I 612.34 + 1 . 85866x3.24 + 6 . 77236i4.23. 

Solving these simultaneous equations^ we secure the fol- 
lowing values for the constants: 

612.34 = — 0.460 6x3.24 = — 1.420 6 x 4.23 = — 0.749. 

The required equation is, therefore, 

Zi= — 0.460a:2 — 1.420a:3 — 0.749x4. 

This is the equation of regression of xi on xz, and X 4 . 
Any given values of the three independent variables (June 
temperature, July temperature, and August temperature) 
may be substituted in this equation, and the most prob- 
able value of the dependent variable (corn yield per acre) 
determined. In the equation as it stands, it should be noted, 
all the variables are expressed as deviations from their re- 
spective arithmetic means. For practical purposes it is ad- 
visable to have an equation in terms of the original values. 
In other words, it is desirable to shift the origin from the 
point of averages to the zero point on the original scales. 
This necessitates re-introducing the constant term a. 

The value of a may be determined from the equation 

Ai = a -{■ . 426 x 2,34 •+• 4 . 36 x 3.24 ■+■ 446 x 4 . 2 s 
where the 4’s represent the respective arithmetic means." 
Inserting the proper values, we have 

^ Any method of solution may be employed. Perhaps the most convenient 
with three or more equations is the Doolittle method. This is explained in 
detail in Appendix A. 

2 This equation is derived from the first normal equation, as given on p. 537, 
S(Xi) = Na 4“ 5^12.S4S(X2) + 6iS.24S(jr3) + 5 i4.23S(X4)> 

Replacing the absolute numbers Xi, X 2 , etc., by their equivalents :ri + 

X 2 4* A. 2 , etc., we secure 

'Z(xi) 4" NAi = Na -f* 5i2,34[S(a;2) + ArA, 2 ] + 5i3.24[2(a;3) *4“ NAz] 

4“ 5i.i.23[S(T4) "i" iVAd- 

Since X(xi) = 0, X(Xi) « 0, etc., these values disappear. Dividing through by 
N we obtain the equation presented above. 
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19.6341 = a + 73.6705(- 0.46005) + 78.4864 (— 1 .41967) 

+ 77. 4795{- 0.74910).’ 

Solving, 

a = 222.99. 

The equation of regression in terms of original values is, 
therefore, 

Zi = 222.99 - 0.460X2 - 1.420X3 - 0.749*. 

COMPUTATION OF THE STANDARD ERROR OF ESTIMATE 

Are estimates based upon this equation any more reliable 
than those based upon the equations previously d('rivt:>ci, 
each of which referred to a single independent variable? 
To answer this question the value of the standard error 
must be computed. This will be represented in the present 
case by £^1.234, the subscripts referring to the single dependent 
variable (Xi) and the three independent variables. This 
value may be computed from the formula’’ 

^ The arbitrary origin is at zero on each of the original scales, iieiiee Ij; * ci, 
A 2 = C2, etc. To ensure greater accuracy in solving for the values of the co- 
efficients hu.Mf hi 3 . 24 , etc., are given to a greater number of decimal places than 
in the equation of regression. 

2 This formula may be derived as follows: Given an equation of the type 

Xi = &12.34^2 +^13.24*^8 + bi4,23-r4 

(in which the variables refer to deviations from the means) each residual iimy 
be computed from the equation. ■ 

d = bu.MX 2 A" hi. 3 . 242:3 -f” hu.tzXi — a. (1) 

Multiplying throughout by d, and adding, we have 

SC#) huM'^idx^) A" hsAi^(dxu) A" bH,n^{dXi) .^{dxil 

but it follows from the method oi Mting tlmt 

2idxt) « 0 
■' l^idxz) « 0 . : ■ 

'S{dx^ « 0 

and, therefore, S(cl‘^) = *- 'Z{dx^. (2) 

Multiplying each residual equation ( 1 ) by a:i and adding, we ki\T.‘ 

Ji{dxi) ~ hi2.s4S(%X2);-b/l^.j3.24S(a:xa:3). + 

Substituting the equivalent of Z(dxi) in equation (2) we secure 

: (Footnote cmitinmd on page 
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<5^.234 = O’!* ~ 5i2.34Pi 2 ~ 5l3.24Pl3 ~ bu^sPu- 

Substituting the proper values, we have 

SS.iu = 44.6878 - 4,6586 - 16.4215 - 6.5331 
= 17.0746 
= 4.13.1 

This is to be interpreted just as the standard error of estimate 
was interpreted in previous cases. The reliability of estimates 
based upon the mean value of Xi is measured by cti, which 
has a value of 6.68. The reliability of estimates based 
upon the equation of net regression, when yield is considered 
as a function of temperature in June, July, and August, is 
measured by Si. 2 u which has a value of 4.13. It is clear 
that estimates made from the equation are distinctly more 
reliable than those based upon a knowledge of Xi alone. 
We have by no means accounted for all the factors that 
are responsible for variability in corn yield, but we have 
measured and reduced to precise terms the effect of three 
factors upon the yield of corn per acre in Kansas. 


(Footnote 2 continued from 'page 6 ^L) 

2(^2) = S(:ri2) - huM^(XiXf) 


bu. 23 ,Z(xiXi) 




^2) 

n’ 


N 


^(XiXz) 




fei4.2; 


'SiXiXi) 


Since the variables refer to deviations from the means, we have 

iS^.234 = cri^ — hl2.34Pl2 — hl3.24pis "" &l4.23Pl4* 

See Appendix A for a general derivation of these relations. 

^ For precise work, when the sample is small, allowance should be made in 
computing S for the number of constants in the equation of regression. Since 
there are four constants in the present equation, the 44 observations have but 
40 degrees of freedom to deviate from the computed values. Denoting by 
the corrected value of the standard error of estimate, and by m the number of 
constants in the equation of regression, Ezekiel gives 

s* = 

\N — mj 

applying this correction to the present measurements, we have 

17.0746(1^^) 

= 18.355 
<Si,284 =s 4.28. 
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THE COEFFICIENT OP MULTIPLE CORRELATION 

We have need now of our third measure, the ahstrarf 
coefficient of correlation. The value of this coefficient, as 
we have seen, depends upon the relation between S and <7. 
It may be computed in the present instance from the 
formula 

P2 _ 1 'S^l.234 

1.234 = i —5 

ffl- 

When the relationship between a single dependent variable 
and several independent variables is being studied, (his 
measure is termed the coefficient of multiple coiTelatJon 
and is represented by the symbol R. The suliscript to 
the left of the period relates to the dependent variable, 
while those to the right relate to the independent varial)le.s. 
Substituting in this formula the equivalent of we have 

D, __ 1 cd — bn.upit — — bu.nPu 

it 1.234 = 1 if 


which reduces to"^ 


RVni 


bli.uPn ~l~ f>13.24pl3 ~f~ fel 4.23Pl« 


Inserting the proper values we have 


jRb.234 


4.6586 + 16.4215 + 6.5331 
44.6878 


R\.ni = .6179 
Ri.2u = .786. 


For the same reason that estimates of p (‘ompiiterl from 
samples must be corrected by making allowance for the 
n um ber of constants in the regression equation, correction 


^ The coefficient of multiple correlation may also be rlorived from tlif gciirnii 
formula^ wffiich refers to an origin at zero on the original scales. This, goiimil 
formula is 
i?h.243 . . , » ' 
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must be made in E. For if the number of constants is 
equal to the nui^er of observations, R will necessarily 
equal 1. Using R to denote the corrected coefficient of 
multiple correlation and m to denote the number of con- 
stants in the equation of regression, Ezekiel gives 

In the present example 

_=.5892 

fi=.768. 

In later references to this illustration the uncorrected 
measure is used, though it is to be understood that the 
corrected measure provides a somewhat closer approximation 
to the true R than does the uncorrected coefficient. 

The coefficient of multiple correlation is an index of the 
degree of relationship between a single dependent variable 
and a number of independent variables, in combination. 
It measures the degree to which variations in the dependent 
variable are related to the combined action of the other 
factors. Its significance may be clearer if all the independent 
variables are looked upon as constituting a single independ- 
ent series. The coefficient is then seen to be a measure 
of the relationship between the dependent variable and the 
independent series, which is precisely what the coefficient 
of correlation is in the simpler case of two variables. In 
the multiple case the independent series has several com- 
ponent elements, but this jfact does not alter the essential 
significance of the coefficient. No positive or negative sign 
is attached to R, it should be noted. In the present instance 
all of the independent variables are negatively correlated 
with corn yield, and a negative sign might be attached. 
The correlation could be positive, however, for some of 
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the independent variables, and negative for others. Because 
of this fact, R is always given without sign. The signs of 
the constants in the equation of net regression show which 
of the independent variables are positively correlated and 
which are negatively correlated with the dependent variable. 

The sampling error of the coefficient of multiple correlation 
may be estimated from the formula 

1 - 

(Tb = -7 ==^— :r 

VN — m 

where m is the number of constants in the equation of 
regression. A more accurate test of the significance of R 
may be applied with reference to Fisher’s z-table, discussed 
in Chapter XV. The deviations of actual from computed 
values serve as a yardstick for testing the variability in 
Xi that is attributable to X 2 , X 3 , and X4, as the relationship 
is defined by the equation of regression. In eoinmon with 
other correlation problems, this one reduces to a comparison 
of variances. 

The sum of the squares of the deviations of the observed 
values of Xi from the computed values is 751.2824. The 
sum of the squares of the deviations of the computed 
values of Xi from the mean value of Xi is 1,214,9808. 
Since there are 44 observations, and since the equation 
of regression contains four constants, there are 40 degrees 
of freedom in the deviations from the regression function. 
The three coefficients of regression (other than the con- 
stant a) give three degrees of freedom to variation among the 
computed values of Xi, The test takes the follownng form. 


Nature of tmnaUUty ■ 

Degrees of 
freedom 

Eum of 
^quami 
■ 'deviaiiom 

M&m 

Hgmtre 


Variation among computed 
values 

Deviation of observed from 


■1,214.9808 

404.9936 

6.0039 

computed values 

40 

751.2824 

18.7821 

2.9329 



1,966. 2632 Different == 

^ a ''in Cl 





U . 6355 
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For Ui = 3, »2 = 40, the 1 per cent value of z, as derived 
from Appendix Table VI, is .7308. The present value is 
greatly in excess of this. The variation in Xi attributable 
to the influence of Xi, Xi, and Xi is clearly greater than 
the residual variability here used as the yardstick. The 
measure of correlation, R, is unquestionably significant. 

COMPAEISON OF MEASUEES OF EELATIONSHIP 

The degree to which our knowledge of the causes of 
variation in corn yield has been improved and the reliability 
of our estimates increased by taking account of the various 
factors in combination may be more readily appreciated 
if we bring together the various measures secured in the 
course of this analysis. 

Table 129 

A Comparison of Certain Measures Pertaining to the Corn 
Yield in Kansas 

Measure of Coeffiderd 

Basis of estimate reliability of 

of estimate correlation 

Arithmetic mean of Xi = 19.63 o-i = 6.68 

Zi = 100.35 - 1.096X2 >Si 2 = 5.80 na = - .4984 

Xi = 166.07 - 1.866X3 Sis = 4.81 ris = - .6948 

Xi = 119.45 - 1.288X4 Su = 5.78 r,4 = - .5013 

XT = 222.99 - 0. 460X2 - 1 . 42OX3 

-0.749X4 5,23, =4. 13 Ri.m= .7861 

The value of S might be further reduced and the value 
of R correspondingly increased by bringing into the analysis 
other factors, such as rainfall during the growing months. 
The method which has been explained may be extended 
to cover any number of variables, one equation being added 
to the set of simultaneous equations for each additional 
variable introduced. 
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the method of multiple coerelation valid for linear 
RELATIONSHIPS 

One important condition has not been emphasized in the 
course of the preceding discussion. The validity of this 
method of multiple correlation depends upon the existence 
of a linear relationship between each pair of variables. 
Thus with four variables there were six pairings pos.sibIo 
(i.e., six mean products were computed). If there had been 
a material departure from linearity in any of the.se .six 
relationships the significance of the results would have been 
decreased. There would be no fallacy involved in the use 
of the equation under these conditions, but it. would not 
furnish as good a basis for estimates as one which took 
account of the true relationship. In such a case the values 
of S and R would indicate that the estimates based upon 
the assumption of linear relationship were not very reliable.’ 

AN APPLICATION OF THE METHOD 

Let US illustrate the use of the estimating equation. 
In the year 1933 the average June temperature in Kansas 
was 80.5° F., the average July temperature was 81.4° F., 
and the average August temperature was 76.8° P. What 
was the probable corn yield per acre? Substituting these 
values for Xs, Xs, and X 4 in the equation, 

Xi = 222.99 - 0.460X2 - 1.420X3 - 0,749X4 
we have 

X2 = 222.99 - (0.460 X 80.5) - (1.420 X 81.4) 

- (0.749 X 76.8) 

Xi = 12.85. 

The estimated yield for 1933 is thus 12.85 bushels per acre, 

^ An approach to problems of multiple . correlation when the relationship 
between the subordinate series is non-linear is explained by M. 3, B. Ezekiel in 
the Journal of the American Staiisiical Associaiion^ ¥oL XIX^ N. S. No. 148^ 
i924y and in his book Methods of Correlaiim Analysis, 
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What are the limits within which we may expect the 
actual yield to fall, with respect to this estimate? The value 
of 5 i. 234 is 4.13 bushels. This means that the odds are 
68 out of 100 that the actual yield will be within the 
limits 8.72 bushels (i.e., 12.85 — 4.13) and 16.98 bushels 
(i.e., 12.85 +4.13). The actual yield in 1933 was 11.5 
bushels per acre. 

In this illustration we have used one of the years in- 
cluded in the study. The same method would be employed 
in making an estimate for a future year. (Additional ele- 
ments of uncertainty are introduced, of course, whenever 
results secured for one period are applied to another time 
period.) Thus, from the temperatures in 1936 (76.7° in 
June, 85.5° in July, and 84.4° in August), an estimate 
of 3 .1 bushels per acre is yielded by the regression equation 
employed above. This was a summer of exceptional heat 
and drought. The actual yield was 4 . 0 bushels per acre. 

The Meaning of Partial or Net Correlation 

In the preceding section we have sought to determine 
the degree to which corn yield in Kansas is affected by the 
temperature in June, July, and August, treating the three 
independent variables in combination. Our aim has been 
to measure their combined effect upon corn yield. There 
is a related problem, which in many studies may be of 
major importance. This is the determination of the rela- 
tionship between a dependent variable and a single indepen- 
dent variable when all other factors included in the study are 
held constant. Concretely, what would be the effect upon 
corn yield of variations in July temperature, if June tempera- 
ture and August temperature could be held constant? This 
is the problem of net or 'partial correlation. 

It is obvious that if a method could be developed by 
which two variables could be isolated for separate study, it 
would add immeasurably to the analytical powers of the 
economist, and of social scientists in general. It would give 
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to the student in these fields that power to eliminate irrele- 
vant influences and to concentrate his attention upon a single 
factor which is possessed by the chemist, for example. In 
studying the effect of one element upon another the (‘hemisi 
seeks to eliminate all other elements, and the e{Teeti\’eiiess of 
lus analysis depends in large part upon the degree to which i) 
is possible thus to isolate the object of immediate interesi . 

It is not generally possible in economic analysis to 
eliminate all but one of the factors responsible for variations 
in a given series. The direct and indirect causes of a givtni 
economic phenomenon are too numerous and toe > coni j ilieai s'd 
in their interaction for the economist ever to hop(^ to emulate; 
the chemist in reducing his problem to terms of liut two 
variables. But, within certain limits, the statistician is 
able to employ the method of the physical scicnli.st, in 
holding constant certain factors while the effects of varia- 
tions in another are studied. The methods which make this 
possible are among the most powerful of the instruments 
which the student of the social sciences possesses. 

The method of partial correlation may be explained with 
reference to the problem of corn yield in. Kansas. Our 
object is to determine the net correlation between corn 
yield and the temperature in each of the three months 
for which the average temperature is given. 


DISTINCTION BETWEEN PABTIAL AND SIMPLE COR11ED.VTION 


It is important to distinguish between this problem and 
that faced in the ordinary measurement of relationship 
between two variables. We have already s(.;(!ureci, as a 
description of the average relationship between corn ydeld 
and July temperature, the equation 


with 

and 


Xi.= 166.07 - 1 . 866 X 3 
Su = 4M 
rxs=-.6948. 
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These measures describe the relationship in question when 
all other factors are ignored. They are not taken account 
of. They are merely neglected. It is as though the chemist, 
in studying the reaction of one element to another, used 
a test tube containing various impm-ities, which he made 
no attempt to remove. The economist cannot, in general, 
locate and remove all the “impurities” in his problem, but 
he should recognize that his measures relate to such 
uncorrected data. 

The Method op Partial Correlation 

In seeking to determine the net correlation between corn 
yield and July temperature we attempt to secure a measure 
of the correlation which would prevail if other factors might 
be held constant. We shall take full account of the other 
factors we have studied, but we shall try to secure a meas- 
ure influenced only by fluctuations in July temperature, in 
relation to corn yield. 

One possible method of accomplishing this end may be 
suggested. If one possessed data covering a very long 
period we might be able to pick out a number of years 
during which the average temperatures in June and August 
remained imchanged. Let us say that we could find thirty 
years in all, during each of which the June temperature 
averaged 74° and the August temperature 78°. Corn yield 
and July temperature varied during these years. The re- 
lationship between July temperature and corn 3deld might 
now be measmed, and it would be certain that the results 
would not be affected by the presence of fluctuations in 
June temperature and August temperature. Unfortunately, 
this method of holding certain factors constant cannot be 
employed. The data are too limited and too varied, in 
general, to enable us to pick from among them such figures 
as are appropriate to our purpose. Other methods of 
arriving at the same end are available, however. 

As a first step, let us derive the equation defining the 
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relationship between corn yield as dependent variable and 
June temperature and August temperature as independent 
variables. This will be of the form 

Ai = a + 512 . 4^2 + 614 , 2 ^ 4 . 

We solve for the constants exactly as in the preceding 
example, except that variables X,, X 2 , and X 4 only are,, 
employed. The desired equation is 

Xi = 160.97 - 0. 856X2 - 1 . OlOX.,. 

We may determine the value of the standard error of 
estimate from the relation 

'S'^1,24 = (Ti® — 612 . 4 P 12 — tl4.2Pl4- 

We secure 

= 27.2112 
&.24 == 0 . 22 . 

If corn yield per acre is estimated from June temperature 
and August temperature the standard error of estimate, 
or the standard deviation of the remaining variability, is 
5.22 bushels. But we know that if corn yield is estimated 
from June, July, and August temperature, the standard 
error of estimate, or the standard deviation of the remaining 
variability, is 4.13 bushels. The measure of remaining or 
“unexplained” variability is reduced from 5.22 to 4.13 
by the addition of July temperature (Xs) to the estimating 
equation, after account has already been taken of the 
influence of June temperature (X2) and Augusi, tempcralnro 
(X4). The difference between these two measun>.s may be 
taken to represent a relationship between X'l and X 3 which 
is not affected by variations in X 2 and X 4 . 

We have seen that the degree of correlai ion between a. 
dependent variable (Xi) and an independent variable (X») 
may be defined by the relation 
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Here the denominator of the fraction in the right-hand 
member defines the original variability of Xi, while the 
numerator of that fraction defines the variability of Zi 
after account has been taken of the influence of X3. In 
the present problem we have 


23.1134 

44.6878 


.4828 


ri3 = — .695. 


The coefiicient of correlation is given the sign of bu, the 
coefficient of regression. 

In exactly the same way, we may say that the nei 
correlation between Xi and X3, when the relationship is 
not affected by fluctuations in X2 and Xi, is defined by the 
relation 


Here the denominator of the fraction in the right-hand 
member defines the variability remaining in Xi after account 
has been taken of the influence of X2 and X4, while the 
numerator defines the variability remaining in Xi after 
account has been taken of the influence of X2, X3, and X4. 
Numerator and denominator differ only because of the presence 
of correlation between Xi and X3 that is incremental to any 
correlation that may exist between Xi on the one hand and 
X2 and Xi on the other. If the equation 


Xi = 222.99 - 0.460X2 - I.42OX3 - 0.749X4 


gives estimates no more reliable than those derived from 
the equation 


Xi = 160.97 - 0.856X2 -1.010X4 

then numerator (jS®i. 234) and denominator (^^.24) of the above 
fraction will be equal, and will have a value of zero. 
But if the equation containing X2, Xs, and X4 as independent 
variables gives better estimates than does the equation 
containing only X2 and X4, the numerator will be smaller 
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than the denominator, and r\iM will have a value other 
than zero. If the estimates based upon the three independ- 
ent variables are in exact agreement with observed yields, 
/Sh .234 will be equal to zero, and will have a value of 
unity. 

Employing the values derived above, we have 


^*“ 13.24 


17.0746 

27.2112 


= .3725 


ri3.24 = — .610. 


The coefficient of net correlation, is negative, having 
the same sign as the coefficient of net regression, 

The quantity ri3.24 measures the degree of correhiioH 
between Xi and Xa when neither one is affected by variations 
in Xa and X 4 . It may be thought of, equally well, ns a 
measure of the degree to which eiTors in estimating Xi 
are reduced when use is made of X3, after full account has 
already been taken of the influence of X 2 and X 4 on Xi. 

The meaning of the symbols employed in the above 
demonstration should be clear from the context. As in 
the coefficients of net regression, the first of the subscripts 
to the left of the point (the primary subscripts) refers to 
the dependent variable; the second of the primary sub- 
scripts refers to the single independent variable to which 
the measure of net correlation applies specifically. The 
subscripts to the right of the point (the secondary sul)~ 
scripts) indicate the variables which are held constant for 
the purpose of the particular comparison being made. The 
number held constant is two in the present case, though 
it might be one, or any other number. Thus the general 
formula for the coefficient of net correlation between vari- 
ables Xi and X 3 would be 


rh3.2466 


1 


Sh-ium . . , n _ 

lS^l.2456 . . . >t 


The variable that is present in the numerator and absent 
in the denominator is the particular independent variable 
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that is being paired with the dependent variable for the 
purpose of measuring net relationship. 

The coefficients of net correlation between Xi and each 
of the other independent variables may be derived in similar 
fashion. Thus 


r^n.u 




S\.^ 


1 - 


<Sh.34 

lSh.234 


Sh.z 


In each case the difference between numerator and denomi- 
nator of the fraction in the right-hand member measures 
the net reduction in the variability of Xi which is associated 
with a relationship between Xi and a single independent 
variable, account having already been taken of the influence 
of two other variables. 

It is clear that such measurements as these are net only 
with respect to the variables represented by the secondary 
subscripts. The coefficient ri 2.34 measures the degree of 
relation between Xi and X 2 when Xz and Xi are held con- 
stant. There may be many other factors affecting Xi and 
Xz] the disturbing influences of such factors have not been 
eliminated. These other factors still muddy the water of 
analysis. Ignoring them is not the same as holding them 
constant. Only by direct measurement and inclusion in 
the study, as was done with X 3 and Xi, may the influence 
of additional variables be effectively eliminated. 

Another Method oe Computing Coefficients of 
Partial Correlation 

Obviously a whole series of coefficients of net correlation 
may be computed in dealing with a number of variables. 
In deriving a number of such measurements a method may 
be utilized which differs somewhat from that employed 
above, and which has certain advantages in the way of 
systematic arrangement. 
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A simple coefficient of correlation relating to but two 
variables is termed a coefficient of zero order. Such coefficients 
are represented by symbols of the type r^, r^^, etc. Coeffi- 
cients of net correlation which relate to two variables, 
while a single additional variable is held constant, are 
termed coefficients of the first order, and are represented by 
symbols such as r 12 . 3 , r 24 . 3 , etc. Similarly, we may have coeffi- 
cients of the second, third, fourth, or nth order, depending 
upon the number of variables held constant while the relation- 
ship between a single dependent and a single independent 
variable is being measured. 

It is possible to derive each coefficient of partial correla- 
tion from those of the next lower order. Thus a coefficient 
of the first order may be derived from the relation 

“ (1 - (1 

For a coefficient of the second order 

^12.3 — ri 4.3 • 7’24.3 

(1 - rVsli (1 - 

As a general equation for a coefficient of net correlation 
of any order, ^ we have 

_ ^12.345 . . . ^171.345 . . .'(n-l) ' ^2n.34fi-. . . 

' 12.345 , n (1 — r\n.U5 , . .'(n-l))^ (1' 

Thus it is possible, starting with the zero order coefficients 
of correlation, to compute all higher order coefficients 
successively. The mere arithmetic of calculation would be 
laborious, but certain prepared tables reduce tho.st; coinputa- 

: will be noted that in an equation used in computin#^ a of 

partial' correlation the three r’s in the numerator of tiie rlgiit-iiarid liiember, 
have the same secondary, subscripts, and .that these sect)ndary sulisctiptH are 
one less in number, than the secondary' subscripts of the left-hand 'niernHer; 
that the first r in the... .numerator has, the same primary subscripts as the left 
hand member; that the. second and third ris m the numerator have primary 
subscripts composed .of''One of the primary subscripts of the iefldiaad member 
plus the missing secondary subseript;.'tliat-the two r’s in the dericjiiiiniitor are 
the same as the second and third r^s in -the numerator. 
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tions to a minimum. ^ The method may be illustrated, using 
the data of the preceding problem. 

In the present case we require three coefficients of tbe 
second order, Tum, and ri4.23. These will serve as 
measures of the net correlation between corn yield and 
temperature in each of the three critical mouths. The 
formula from which the first of these measures naay be 
computed was given above. For the second, we have 

__ ~ ?l4.‘2 • ^’34.2 

“ (1 - r^4.2)^ a - rV2)i 

and for the third 

__ ?’14,2 ^13.2 • ?’43.2 

ri4.23 - 

But each of these values may be derived from a slightly 
different grouping of first order coefficients. We may use 
the three formulas 

_ ?"12.4 TuA * T2ZA 

~ a - rh,.,)i {I - 

__ ^’13,4 — ?’12.4 * '?"32.4 

U.,.24 - 

Ti2.Z • f42.3 

“ (1 - r“i2.3)i (1 - rVa)^’ 

By employing both methods in computing each second 
order coefficient a cheek upon the calculations is afforded. 

COMPUTATION OF FIRST ORBEE COEFFICIENTS 

The second order coefficients cannot be computed until 
all necessary first order coefficients have been secured. 
The necessary equations, of the type 

_ rn — ri3-ru 

- (1 _ r\3)Hl - ’ 

may be constructed from the general formula for coefficients 
of partial correlation. Since several of these values must 
be computed, a systematic arrangement should be employed. 

^ J. R. Miner, Tables n/ v^l — r® and 1 — r^for use in Partial Correlation u^id 
in Trigonometry, Johns Hopkins Press, Baltimore, Md., 1922. 
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Table 130 

llhistrating the Computation of First Order Coeffidenis of Partial 

Correlation 


(Kansas corn yield and temperature) 


r 0 Order 

1 

Product 

Whole 

numerator 

Demni- 

ififdor 

r 1st Orfkr 

script 

Coef- 

ficient 

(1 - 

term of 
7mm£7'a.tor 

Sub- 

script 

( Vi|/- 

firkrft 

12 

13 

23 

- .4984 
.6948 
+ .3938 

. 7192 
.9192 

- .2736 

- .2248 

.6611 

12.3 

. 34IW 

14 

!3 

43 

- .5013 

- .6948 
+ .2868 

. 7192 
. 9580 

- .1993 

- .3020 

.6890 

14.3 -- 

. 4383 

24 

23 

43 

-f' . 2775 
+ .3938 
+ .2868 

.9192 
. 9580 

+ .1129 

+ .1646 

,8806 

24,3 + 

. 1869 

13 

12 

32 

- .6948 

- .4984 
+ .3938 

. 8669 
.9192 

- .1963 

- .4985 

.7969 

13.2 - 

,625.5 

14 

12 

42 

- .5013 

- .4984 
+ .2775 

.8670 

.9607 

- .1383 

- .3630 

.8329 

14.2 - 

.4358 

34 

32 

42 ; : 

+ .2868 
+ .3938 
+ .2775 

.9192 

.9607 

+ .1093 

+ .1775 

.8831 

34 . 2 + 

.2010 

■' 12 ■ 

M4 

24 : 

- ..4984 
-",.5013 
+ ,.2775 

. 8653 
.9607'. 

- .1391 

-- .3593 

.8313 

12.4 - 

.4322 

■ la: 
14 

34 

-- .6948 
- .5013 
+ .2868 

.8653 

.9580 

- ,1438 

-,..5510 

;829C) 

13.4 - 

.§647 

23 

24 
;i4- 

+ .3938 
+ .2775 
+ .2808 

,9607 

,9.uS0 

+ .0796; 

+ .3142 

J204 

23.4 + 

:MI4 
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The procedure in computing each first order coefficient 
is simple. Three zero order coefficients are necessary for 
each calculation. These should be arranged in the table 
in the order in which they occur in the numerator of the 
fraction from which the required coefficient is to be com- 
puted. The numerator of this fraction is secured by sub- 
tracting from the first zero order coefficient the product of 
the other two. This product term appears in one column 
of the table. The denomina tor of t he fraction is the product 
of two terms of the type Vl — derived from the second 
and third coefficients in each group of three. The tabular 
arrangement of Table 130 on page 557 permits these com- 
putations to be carried forward systematically. 

The coefficient is, of course, identical with r32.4; 
r34,2 is identical with r43.2, etc. It is imnecessary to duplicate 
the work of computation with I’espect to these measiu’es. 

COMPUTATION OP SECOND ORDER COEFFICIENTS 

From these first order coefficients the three required 
second order coefficients may be secured by methods analo- 
gous to those employed above. The computations are 
shown in Table 131 . As a check upon the calculations each 
required measure is computed from two different combina- 
tions of the first order coefficients. 

The value of ?'i3.a4, it will be noted, is the same as that 
derived from the relation between /S1.24 and S1.234. 

The meaning of such coefficients as these was explained 
in the earher section dealing with this problem. The follow- 
ing summary of results reveals the gain in knowledge which 
has resulted from the above analysis. 

Tiz — .4984 rx2.34 — — .2923 

ri3 = — .6948 ri3,24 = — .6101 

ri4 = — .5013 ri4.23 = — .4057 

It is clear that the net effect of June temperature upon 
corn yield is distinctly less than was indicated by the simple 
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Table 131 


Illustrating the. Computation of Second Order Coefficients of Partial 

Correlation 

(Kansas corn yield and temperature) 


f Isl Order 

4 

Product 

Whole 

mmeraior 

Demwi- 

motor 

r 2nd Order 

Sub- 

script 

Coef- 

ficient 

(1 - 

term of 
numerator 

S'ub- 

script 

Coef- 

ficknl 

12.3 

14.3 

24.3 

- .3400 

- .4383 
+ . 1869 

.8988 

.9824 

- .0819 

- .2581 

.8830 

12„34 

- .2923 

13.2 

14.2 

34.2 

-- .6255 
- .4358 
+ .2010 

.9000 

.9796 

- .0876 

- .5379 

.8816 

13.24 

- JlOl 

14.2 

13.2 

43.2 

- .4358 

- .6255 
+ .2010 

.7802 

.9796 

- . 1257 

.3101 

.7643 

14.23 

- .4(157 

12.4 

13.4 

23.4 

- .4322 

- .6647 
+ .3414 

.7471 

.9399 

- .2269 

~ .2053 

.7022 

12 ..34 

~ .2924 

13.4 

12.4 

32.4 

- .6647 

- .4322 
+ .3414 

.9018 

.9399 

- .1476 

- .5171 

.8476 

13.24 

-- ,6101 

14.3 

12.3 
42., 3. 

-- .4383 
- .3400 
+ . 1869 

.9404 

,9824 

- .0635 

~ .3748 

.9238 

14.23 

-- .4057 


correlation. This is so because there is a positive cont^ation 
between temperature in June and temperature in July and 
August, so that the crude correlation of two variables 
alone shows June temperatui'e as more important than it 
really is. For the same reason, all the net coefficients are 
kss than the simple coefficients, though it is .still apparent 
that July temperature is far more important, in relation 
to corn yield, than the temperature in either of the other 
months. 
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The coefficients of net correlation are net, of course, 
only with respect to the variables actually taken account 
of, and held constant. Thus there may be other factors, 
such as rainfall in June, July, or August, which affect 
corn yield and which are correlated with the temperature 
during these months. Were these included the various 
coefficients of net correlation might have different values 
from those given. 

The sampling error of a coefficient of partial correlation 
may be estimated from the same general relations that 
hold for zero order coefficients, except that the factor iV - 1 
must be further reduced by the number of variables held 
constant. Thus for r 12.34 we have 

^ _ 1 — r\s.g4 

A MEA8UEE OF VARIABILITY 

Having these coefficients of net correlation, another 
measure of some importance may be computed. This is 
a measure of the variability of a single character while a 
number of related variables are held constant. Thus the 
question might arise: If we could hold constant the tem- 
perature in Kansas in June, July, and August, what would 
be the variability of the corn yield? In other words : If we 
could eliminate such variability in corn yield as is due to 
variability in temperature, what fluctuations would remain 
in the yield of corn? This measure of variability is repre- 
sented by the symbol <ri.284 . . . It is termed the standard 
deviation of order ft. 

This measure may be computed from the general equation 

. . . n = cri^(l — rS^{l — r^ 3 .a)(l — T\i2z) ... 

(1 — r\n.n ■ ■ . »-i)- 

Applying this formula to the results of the study of corn 
yield, we have 
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(r“i.234 = 44.6878[1 -(- .4984)2][1 -(- .6255)=][1 -(- .4057)=| 
firW = 17.0797 

<?*1.234 ~ 4. 13. 

Referring back to the discussion of this problem we find 
that the values of (ri .234 and < 81.234 are identical. That is, 
the standard deviation of variable Xi, when variable.s Xs, 
Xs, and Xi are held constant, is merely the standard devia- 
tion of observed values from computed values of Ji. It 
is the standard error of estimate, when estimates are based 
upon the factors X 2 , X 3 , and X 4 . The reason for (his is 
obvious. The variability of the original scries is reduced 
to the extent that estimates based upon the equation of 
relationship approximate the actual values. The variability 
which remains is due to differences between these estimates 
and the actual values. But these differences are merely 
the residual deviations, from which S is computed. A re- 
alization of the identity of these two measures may assist 
in making their meaning clear. 

Since cri .234 and < 81,234 are identical, the coefficient of 
multiple correlation, Ri.m, may be computed from the 
equation 


or, using the formula for cr^i 234 . . . from the equation 
1 - fib.23 . . . n = (1 - r=, 2 )(l - r=i3.2)(l - r\i.n) ... 

(1 — rbn.2.3 . . . (n-l,)- 

BETA COEFFICIENTS 

The several coefficients of regression in an equation of 
multiple regression are, in effect, weights applied to the 
different independent variables in estimating the .successive 
values of the dependent variable. Usually these coefficient .s 
of regression are not comparable, because the independent 
factors are expressed in different units, or because they 
differ in variability. It is ofteh desirable to reduce the 
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coefficients of regression to comparable terms. This may 
be done by expressing dependent and independent variables 
alike in units of their respective standard deviations. The 
coefficients of regression are then called beta coefficients, 
and are represented by the symbols i 3 i 2 . 34 , etc. 

In terms of a simple two-variable problem, we have 

Xi — biaXg. 

If we change to standard deviation units we must divide 
both sides of the equation by (Xi and by < 73 . This gives 



<ri \ cTi/ Vs/ 

The desired Beta coefficient is, then, 

fts = 

For the corn yield example, we have 

) 3,3 =- 1 . 866 (^ 1 ^) =- . 696 . 

This may be taken to mean that with an increase of one 
standard deviation in X 3 (July temperature), the yield of 
corn decreased .696 of one standard deviation. 

These measurements are particularly useful in analyses 
involving more than two variables. Here the relationships 
between the beta coefficients and the coefficients of net 
regression are similar to those indicated for the two- 
variable problem. Thus 
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Substituting the required values in these equations, we have 
j3l2.34 = — .209 

ft 354 = — .529 
ft453 == — .292. 

The second of these coefficients may be taken to mean that 
with an increase of one standard deviation in July tempera- 
ture, wLen Jime and August temperatures are held constant, 
corn yield decreases .529 of one standard deviation. The 
other coefficients have similar meanings. 

The beta coefficients relate to factors expressed in com- 
parable units and similar in respect of variability. A 
fluctuation of one standard deviation in may be taken 
to be equal to a fluctuation of one standard deviation in A;!. 
The coefiicients defining the changes in Xi that are likely 
to accompany these equal movements in As and Xz have 
obvious significance. 

CERTAIN LIMITATIONS 

The measures we have described in dealing with problems 
of multiple and partial correlation are valid on the assump- 
tion that the relationships among the different variables 
are in all cases linear. Thus with four variables sis different 
pairs may be obtained. The regression in each of these 
six cases should be linear if combined or net effects are to 
be studied by the methods outlined above. If the regreasion 
is non-linear when natural numbers are dealt with, it, may 
be possible to secure linear relationships by correlating 
logarithms or reciprocals. Thus we might derive an esti- 
mating equation of the type 

Log Ai = c -b 61S.34A2 -b 513.24 Aa -b 5 h 53A4 

if the relation between Xi in logarithmic form and each 
of the other variables in the original arithmetic form were 
linear. The corresponding measures, S and J?, would then 
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Aapto.° Siven in the following 

One other m^rtant limitation should be noted. Coeffl 
cients of multiple or of net correlation based upon a We 
number of variables have Httle significance unless the num 
bei of observations be large. Misleadingly high values will 
e secured when studies involving many variables are based 
upon small samples. (Appfication of the corrections referred 
0 m the teirt will prevent misinterpretation, in such cases ) 
itiun the I^ts set by these restrictions, the methods of 
multiple and partial correlation constitute very powerful 
instruments of economic analysis. 
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CHAPTEK XVII 


THE MEASUREMENT OF RELATIONSHIP AND 
THE PROBLEM OF ESTIMATION 

It is no great exaggeration to say that quantitative 
method in economies and business centers about the prob- 
lem of estimation. Equations of regression, measures of 
standard error and coefficients of correlation are of interest 
largely because of their bearing upon the practical problems 
of determining probable production, probable price, probable 
business changes. It should not be miderstood from this 
that the problem of estimation relates only to attempts to 
forecast future changes. We make an estimate whenever 
we seek to determine the most probable value from a 
number of different observations, or whenever we employ 
an equation which describes the relation between two or 
more variables. The value of statistical technique rests 
in large part upon its practical utility in the making of 
estimates. 

This object has been definitely to the fore in the preceding 
chapters, which dealt with methods by which the value 
of one variable might be estimated from a given value 
of another. We may, at this point, briefly sxmunarize 
certain assumptions upon which the validity of this method 
rests. 

Some Assumptions Involved in the Making of 
Estimates 

In earlier chapters it has been pointed out that the most 
probable value of a series of observations is their arithmetic 
mean. Given a normal distribution about the mean, the 
standard deviation affords an exact measure of the proba- 
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bilities involved in basing estimates upon the mean. 
Similarly, the standard error of estimate affords an exact 
measure of the probabilities involved in basing estimates 
upon an equation of regi'ession, again upon the assumption 
that the distribution about the line of regression is normal. 
The significance and usefulness of the equation of regression 
may be determined by comparing the standard error of 
estimate of a given variable with the standard deviation. 

From the relation between these two values, moreover, 
an abstract measure of relationship, the coefficient or index 
of correlation, may be computed. This coefficient, or index, 
is a thoroughly valid and accm-ate measure only if the 
distribution about the line of regression and the distribution 
about the mean are normal, or approximately so. Pro- 
nounced departures from the normal type lessen the signifi- 
cance of these measures. 

In the foregoing discussion we have been concerned with 
arithmetic values throughout. In speaking of estimates 
based upon the mean we referred to the arithmetic mean. 
The distributions about the mean and about the line of 
regression are assumed to be normal when deviations are 
measured arithmetically. The standard deviation and the 
standard error of estimate are in arithmetic terms, referring 
to absolute values. But may we assume that all the 
distributions we deal with in economic analysis are of the 
arithmetic type? Should estimates be made and errors 
of estimate measured only in arithmetic terms? If they 
should not be so limited, are the method.s developed .above 
capable of adaptation to other distributions? Thes(! ques- 
tions may best be answered in terms of a specific, probhun. 

A Peoblem of Estimation; Logaeithmic and Ratio 

Values 

In Table 132 the production and price of oats in the 
United States from 1881 to 1913 are recorded. Appropriate 
lines of trend were fitted to these series and the ratios 
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Table 132 


Year 


1881 

1882 

1883 

1884 

1885 

1886 
1887 . 
1888 

1889 

1890 

1891 

1892 

1893 

1894 

1895 

1896 

1897 

1898 

1899 

1900 

1901 

1902 

1903 

1904 

1905 

1906 

1907 

1908 

1909 

1910 

1911 

1912 

1913 


Production and Price of Oats in the United States 


Prodmtion 
of oats in 

(millions 
of bu,) 

416 

488 

571 

583 

629 

624 

659 

701 

751 

523 

738 

661 

639 

662 

824 

780 

791 

843 

926 

914 

778 

1,053 

869 

1,009 

1,090 

1,036 

805 

851 

1,068 

1,186 

922 

1,418 

1,122 


Straight 
line trend 
of produc- 
tion ^ 

448 

471 

494 

517 

540 

563 

586 

609 

632 

655 

678 

701 

724 

747 

770 

793 

816 

839 

862 

885 

908 

931 

954 

977 

1,000 

1,023 

1,046 

1,069 

1,092 

1,115 

1,138 

1,161 

1,184 


Ratio of 
actual pro^ 
duction to 
trend value 


.929 

.036 

.156 

,128 

165 


1.108 

1.124 

1.151 

1.188 

.798 

1.088 

.943 

.882 

.886 

1.070 

.983 

.969 

1.005 

1.074 

1.033 

.857 

1.131 

.911 

1.033 

1.090 

1.013 

.770 

.796 

.978 

1064 

.810 

1.221 

.948 


Price of 
oats in 
Chicago 
(cents 
per bu.) 

47 

37 

31 

29 

28 

25 

SO 

24 

24 

43 

31 

SO 

31 

28 

19 

18 

24 

25 
23 
25 

42 
33 

38 

30 

31 

39 

51 

52 

43 
35 . 

51 
37 
41 


Straight 
line trend 
of price - 

36.0 

35.3 
34.6 

34.0 

33.2 

32.5 

31.2 

30.5 

29.8 

29.0 

28.3 

27.5 

26.8 

26.1 

25.3 

23.6 

25.0 

26.4 

27.8 

29.2 

30.6 

32.0 

33.4 

34.8 

36.2 

37.6 

39.0 

40.4 ■ 

41.8 ■: 

43.2 
,44.6 ' 

46.0 

47.4 


Ratio of 
actual 
price to 
trend, value 

1.30 : 

1.05 
.90 
.85 
.84,- 
.77 
.96 
.79 
.81 
1.48 
1.10 
1.09 
1.16 
1.07 
:.75 ^ 

.76 
.96 
.95 
.83 
.86 
1.37 

1.03 
' 1.14 

.86 

1.04 
:i.3i 
1.29 ,': 
1 . 63 : 

,■■■'.81. 

1.14 

.80 
.87 


cludS’t than thlt in- 
to 1913. 7“ h'^ohen into two parts, 1881 to 1895 and 1896 

each ptioi ^ fitted by H. B. Killough to the data of 
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of the actual values of the items in each series to the trend 
values determined. 

It is desired to measure the relation between these two 
variables. A hyperbolic curve of the general type Y = oX’’ 
appears to be an appropriate form to employ in describing 
such a relationship. To fit this curve by the method of 
least squares, the equation must be reduced to the loga- 
rithmic form 

log Y = log a + 6 log X. 

The normal equations required in fitting a curve of this 
type, are 

I S(logF) = Aloga-fh2(IogA) 

II S(log X • log F) = log aS(log X) + hSClog® X). 

The values necessary for the solution of these equations 
are determined from Table 133.^ 

From this table we have 

X = 33 

S(log F) = - .32849 S(logX • log F) = - .1143005 

S(logX) = .037535 S(log“ X) = .096423. 

Substituting in the normal equations, we secure 

- .32849 = 33 logo + .0375355 
- .1143005 = .037535 logo -f .0964236. 

Solving 

log a = — .00861 
6 = - 1.18206. 

The required equation is 

log F = (9.99139 - 10) - 1 . 18206 log A' 
or 

F = .9804X-^-“®“®. 

’ I am indebted to Prof. H. B. KiUough of Brown University for permission 
to use the data presented in Tables 132 and 133. The figures are taken from his 
comprehensive study of the factors affecting oat prices. 
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_ This is the equation which describes the averaffe mi 
lonship^ between the production and the price of nnr 
{when the actual figures for each are expressed as ratio^ 
to the respective lines of trend). The corresoondina ^ * 

IS plotted in Fig. 88 on page 592. ^ 


Table 133 

Computation of Values Required in Fittvr^ a Curve to Data of Oai 
Production and Prices ■ 

Example I 


(2) 

(3) 

Raiio 

Ratio 

of price 
to trend 

of pro* 
duction to 

trend 

Y 

X 

1.30 

.929 

1.05 

1.036 

.90 

1.156 

.85 

1 . 128 

.84 

1 . 165 

.77 

1.108 

.96 

1.124 

.79 

1.151 

.81 

1 . 188 

1.48 

.798 

1.10 

1.088 

1.09 

.943 

1.16 

.882 

1.07 

.886 

.75 

1.070 

.76 

.983 

.96 

.969 

.95 

1.005 

.83 

1.074 

.86 

1.033 

1.37 

.857 

1.03 

1.131 

1.14 

.911 

.86 

1.033 

.86 

1.090 

1.04 

1.013 

1.31 

.770 

1.29 

.796 

1.03 

.978 

.81 

1.064 

1.14 

.810 

.80 

1.221 

^ 

.948 

32.83 

33.338' 17. 


(4) 


log Y 
. 1139434 
.0211893 
.9542425 
.9294189 
.9242793 
.8864907 
.9822712 
.8976271 
.9084850 - 
. 1702617 
.0413927 
.0374265 
.0644580 
.0293838 
.8750613 - 
.8808136 ~ 
.9822712 - 
.9777236 - 
.9190781 - 
.9344985 - 
. 1367206 
.0128372 
.0569049 
.9344985 - 
.9344985 ~ 
.0170333 
.1172713 
.1105897 
.0128372 
. 9084850 ~ 
.0569049 
.9030900 - 
. 9395193 — 


- 1 
~ 1 

- X 

- 1 
- 1 
- 1 
- 1 


(5) 


log X 

.9680157 ~ 1 
.0153598 
.0629578 
.0523091 
.0663259 
.0445398 
.0507663 
.0610753 
.0748164 
.9020029 ~ 1 
.0366289 
.9745117 ~ 1 
.9454686 - 1 
.9474337 - 1 
.0293838 
.9925535 - 1 
.9863238 - 1 
.0021661 
.0310043 
.0141003 
.9329808 - 1 
.0534626 
.9595184 - 1 
.0141003 
.0374265 
.0056094 
.8864907 — 1 
.9009131 - 1 
• 9903389 - 1 
.0269416 
. 9084850 - 1 
.0867157 
.9768083 - 1 


17.6715068 - IS 14.0375350 - 14 


(6) 


log^ Y 
.01298310 
.00044899 
.00209375 
.00498169 
.00573362 
.01288436 
.00031431 
.01048021 
.00837500 
.02898905 
.00171336 
.00140074 
.00415483 
.00086341 
.01560968 
.01420540 
.00031431 
.00049624 
.00654835 
.00429045 
.01725316 
.00016479 
.00323817 
.00429045 
.00429045 
.00029013 
.01375256 
.01223008 
.00016479 
.00837500 
.00323817 . 
.00939155 
00365792 . 
21721807 ' 


(7) 


log^X 
.001022995 
.000235923 
.003963685 
.002736242 
.004399125 
.001983794 
.002577217 
.003730192 
.005597494 
.009603432 
.001341676 
.000649653 
.002973674 
.002763216 
.000863408 
.000055450 
.000187038 
.000004692 ■ 
.000961267 • 
.000198818 ■ 
.004491573 - 
.002858250 
.001638760 - 
,000198818 - 
.001400743 " 
.000031465 
.012884361 - 
.009818214 - 
000093337 - 
000725850 - 
008374812 ~ 
007519613 - 
0005378.55 


(S) 


% Y 'logX 

- .0036444 
.0003255 

- .0028808 

- .0036920 
~ .0050222 

.0050557 
“■ .0009000 
“ .0062524 
■“ .0068468 

- .0166862 
.0015162 

“ .0009539 

- .0035160 

- .0015446 
“ .0036712 

.0008875 
.0002425 
~ .0000483 
“ .0025089 

- .0009236 

- .0091629 
.0006863 

“ .0023036 

- .0009236 
“ .0024515 

.0000955 

.0133113 

.0109580 

.0001240 

.0024656 

.0052076 

.0084036 

.0014027 


096422642 - 

+ 


.1194567 

.0051562 


.1143005 


andard error op estimate in logarithmic terms 

^ equation? With what degree of 
these Qu^str^ estimates be based upon it? To answer 

Since thp compute the standard error, S. 

fitting process was carried through in terms of 
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logarithms, the standard error may be computed in the 
same terms. Following the procedure explained in earlier 
sections with reference to the straight line and the potential 
series, we may derive the following equation relating to 
the logarithmic curve just fitted: 

_ S(log2 Y) - log aS(log Y) - 62(Iog X • log F) 

^ log y — “ 

Substituting the proper values, we have 

S^log y 

.21721807 - ( -.00861 X -.32849) - (- 1.18206 X -.114.300.')) 

33 

.07927928 

33 

S\ey = .0024024 
-Slog »= .04901. 

The standard error of estimate, in the form of a loga- 
rithm, is .04901. As long as we deal with logarithms, 
this is to be interpreted precisely as is the standard error 
with respect to other curves. Assuming a normal distribution 
of logarithms about the curve which describes the average re- 
lationship, the chances are 68 out of 100 that the logarithm 
of a given estimate will not differ from the logarithm of the 
actual value by more than .04901, 95 out of 100 th.at the 
logarithm of the given estimate will not differ from the 
logarithm of the actual value by more than .09802, and 
99.7 out of 100 that the logarithm of the given estimate 
will not differ from the logarithm of the actual Viilue by 
more than . 14703. 

INTERPRETATION OF THE STANDARD ERROR OF ESTIMATE; 
ZONES OP ESTIMATE 

What does this mean in terms of actual values? It means, 
simply, that we are dealing throughout in terms of ratios 
instead of absolute figures. The difference between the 
logarithms of two numbers is the logarithm of the ratio 
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of one of the original numbers to the other. Thus the 
absolute value of in a given case will depend upon the 
magnitude of the values with which we are dealing. If 
the user desires to reduce S to absolute values, it must be 
done always with reference to a given estimate. That is, 
a given value of X is substituted in the equation of average 
relationship and the corresponding value of Y estimated. 
If the logarithmic equation is used, this estimate will be 
in the form of a logarithm. To the logarithm of the estimate 
add the value of Siogy. The anti-logarithm of the number 
thus secured will give the upper limit of a zone extending 
a distance equal to S above the line of regression. From 
the logarithm of the estimate subtract the value of Siogy. 
The anti-logarithm of the number thus secured will give 
the lower limit of a zone extending a distance equal to 
/S below the line of regression. The odds are 68 out of 100 
that the value of Y in the given case will fall within the 
limits thus marked out. The absolute limits corresponding 
to 2<S and 3(8 may be similarly determined. 

The zone thus marked out with respect to a logarithmic 
curve will differ materially from the similar zones already 
described in deahng with simple hnear equations. In the 
simple case a zone extending IS on each side of the estimating 
curve has the same absolute width throughout its length, 
and is centered always at the line of regression. The loga- 
rithmic zone, when measured in natural numbers, is of 
varying width, and, moreover, is not of the same width 
on each side of the plotted curve. It is true, however, that 
the ratios on the two sides of the curve are always equal. 
That is, the ratio of a value IS less than the computed 
value to the computed value is the same as the ratio of 
the latter to a value IS greater. And when the curves 
are plotted on paper ruled logarithmically, the zone included 
within a distance IS on each side of the plotted curve 
takes the symmetrical form found in the earlier and simpler 
cases. A person accustomed to thinking in terms of ratios 
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and to the use of logarithmic paper can readily interpret 
this measure. 

THE STANDARD ERROR OF ESTIMATE IN TERMS OP RATIOS 

Since the ratios are equal throughout, the standard error 
of estimate may be expressed m ratio terms. In the jiresenl 
example we have 

Sr = anti-log Slog !, = anti-log .04901 = 1.12 

where Sr is used to represent the standard error of esi imale 
in terms of ratios. Sioeu, as derived above, is posit ivi*, 
hence the ratio exceeds unity. It is the ratio of I he larger 
number to the smaller. What does it mean? It nieaiis 
that in 68 cases out of 100 the actual value, if it exceed 
the estimate, will not exceed it by more than 12 per rent, 
and, if it fall below the estimate, will stay within a 
limit such that the estimate will not be more than 12 jier 
cent greater than the actual value. This is not a eonveii- 
ient form, since this ratio always expresses the larger value 
in terms of the smaller value. It would be more ronvcui- 
ient to have it always in terms of a percentage of the esti- 
mate. This may be done by putting Siog „ in negative tenns, 
and getting the corresponding natural value. The value 
— .04901 = 9.95099 — 10, which is the logarithm of .8933. 
In this form the ratio is based upon the relation of the 
smaller to the larger number. To make S, readily intelligilile 
we may combine the two, writing 

= .89to 1.12. 

Interpreting this, it means thaty given a normal distribuiiun, 
hi 68 cases out of 100 the actual value will not bo kiss than 
89 per cent of the estimate, or more than 112 per cent of 
the estimate. This has a simple, definite moaning more 
significant for most practical purposes than a similar 
measure in terms of absolute values.^ 

‘ The significance of a measure of reliability in percentage form was 
out by D. H. Davenport in 1922, in an unpublished article, and .‘-uch u incu.-sui'c 
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To find the values of 2S or 38 these percentage figures 
may not be simply multiplied by 2 or 3. The value of 
Siogu must be so multiplied, and the resulting values reduced 
to natural numbers. For convenience in use, the anti- 
logarithms of both the positive and negative values should 
be secured, as in the preceding case. The computations 
are simple. 

a-Siogj, = .09802. 

The anti-logarithm of this value, when considered positive 
is 1.25, when negative, .80. 

SSiogj, = .14703. 

The corresponding anti-logarithms are 1.40 and .71. Sum- 
marizing for the standard error, we have 

/Sr = .89 to 1.12 

2/Sr = .80to 1.25 

3/Sr = .71tol.40. 

The values given for 8r indicate the probable percentage 
limits within which actual value and estimated value should 
fall in 68 out of 100 cases. The values given for 2Sr indicate 
the probable percentage Hmits in 95 out of 100 cases. 
The values of 3Sr indicate the probable percentage limits 
in 99.7 eases out of 100, always on the assumption of a 
normal distribution of the logarithms of the actual values 
about the fitted curve. 

APPLICATION OP THE STAND AED EEEOE OP ESTIMATE 

We may illustrate the use of Siogy. Given a production 
of oats 50 per cent above the trend value (i.e., the ratio to 
trend is 1.50), what is the most probable accompanying 
price ratio and what is the degree of accuracy of this 
estimate? 


has been employed in several studies. There has not been available, however, 
a ready method of computing this measui’e, and its possibilities have not, there- 
fore, been fully realized. 
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The estimating equation is 

log F = (9.99139 -TO) - 1.18206 log X. 

Substituting in this equation the value .176091 (the loga- 
rithm of 1.50) we secure for log Y the value 9.78324 - 10. 
The corresponding natural number is .607. This means 
that if production is 150 per cent of normal (as measured 
by the given line of trend) price will probably be 00.7 
per cent of normal (as measured by the line of trend). 

To determine the reliability of this estimate, the standard 
error must be secured. Employing the values of Sr already 
computed we find that 54 is 89 per cent of 60.7, while 
68 is 112 per cent of 60.7. We interpret these figures to 
mean that in 68 cases out of 100 the actual price prevailing 
under the given production conditions will not be less than 
54 per cent of the normal or trend value nor more than 
68 per cent of normal.^ Corresponding values for 28, 
and 3Sr may be determined in the manner outlined above. 

THE INDEX OF COERBIiATION BASED ON LOGARITHMIC VALUES 

We have still to compute the third measure, the abstract 
index of correlation.^ For an equation of the type 

log F = log a -f 6 log X 

the formula for p reduces to 

2 log aZilog F) -H bSdog X • lo g F) - 

P log^iogx - 2(log=> Y) - Nc\^y ~ 

where ciog„ represents the difference between the arithmetic 
mean of the logarithms of the F-values and the origin 
(in this case, zero on the logarithmic scale). Substit uting 

: 1 A question arises at once as to .the: adequacy of the given linos of trend, 
in the present problem. ' This question is discussed in greater detail in another 
section. ■■ ; 

2 The symbol is. used for. this 'measure of correlation, instead of r, even 
though the relationship in logarithmic, form is. linear. This is done because such 
S' measure,' in terms of logarithms, cannot .'he interpreted in precisely the ame 
way as the ordinary coefficient of correlation. 
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the proper values, we have 

2 

Plog U bg « 

(-.00861 X - ,32849) + (- 1.18206 X - .1143005) - (33 X .00009909^ 
.21721807 - (33 X .00009969) ^ 

^ .13466882 
.2139481 
= .629445 
Piog K log I ~ • < 93. 

The index of correlation has a value of . 793. How is 
this to be interpreted when we are dealing with logarithms 
as in the present case? 

Its significance may be clearer if viewed in terms of the 
relationship 


P iog V log X 


^^log y 
^“log y 


In the present case these values are 


-Slog, = .04901 
viog, = .08052. 

When these values are squared and inserted in the above 
formula, we have 

„ , .002402 

p-ioggiog.- i .006483 

and 

Plog y log a: “ . 793. 

What does this value measure? We have seen that r 
and the more general index p are abstract measures of the 
degree of relationship between two variables, as this rela- 
tionship is described by given functions. The value of p 
in a given case depends upon the variability about the 
fitted line, in relation to the variability about the mean 
of the F’s. If the variability of estimates is materially 
reduced when the equation of regression is used as a basis 
for estimates, instead of the mean Y, the equation may be 
assumed to describe a significant relationship. The value 
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of p depends thus upon the relation between the two 
quantities, Sy and cr^. 

In the eases dealt with in the preceding chapter the 
variability in each case was measured in terms of absolute 
deviations, and the value of p depended upon the relation 
between the two given measures of absolute variability. 
The sole difference in the present case is that we are working 
in terms of logarithmic or ratio variability, deviations being 
measured in terms of logarithms instead of natural numbers. 

The index p must be interpreted in the light of this fact. 
Its value, as always, depends upon the relation between 
two measures of variability, and cr^, but in the present 
instance these are expressed in terms of logarithms. In 
brief, the value of p depends upon the relation between 
the rat}o variability about the fitted cui’ve and the ratio 
variability about the geometric mean of the F’s. (It is 
the geometric mean of the F’s, because that is the value 
corresponding to the arithmetic mean of the logarithins.) 

We have here a set of measures, therefore, which perform 
in the field of ratios precisely the same sciwice as is per- 
formed in the field of natural mmibers by S and p (in t he 
linear case, r). These measures are secured in the .same 
way as are S and p, except that the equation of relationship 
from which they are derived is one in which the dependent 
variable is log Y (or, in the reverse case, log X). The general 
formulas for computing these values are the same as in 
dealing with natural numbers, except that log Y replaces 
F throughout. The operation is analogous to that of using 
logarithmic paper instead of natural scale paper, 

It should be noted that the values are in logarithmic oi- 
ratio form if Y is expressed logarithmically, wiiether A 
be so expressed or not. Thus we have fitted a curve ol 
the type 

log F = log a + b log X 

the logarithmic form of the ordinary parabola or hyperbulit . 
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The values S and p would also be in logarithmic form if 
the curve were of the type 

log F = log a + Z log 6 
the logarithmic form of the exponential 

F = a(6^). 

In each of these cases the logarithmic equation is linear, 
but this is not essential to the use of these measures. S and 
p are generally applicable measures, whether ratios or nat- 
ural numbers be dealt with, and whether the functions be 
linear or otherwise. 

It may be well at this point to summarize the symbols 
that have been used and to distinguish the different meas- 
ures. We may employ the symbols Sy, ay, and p when 
arithmetic relations are in question, the two former being 
measures of variation in absolute terms, and the index p 
referring to degree of relationship when natural numbers 
are employed. If the logarithms of the F’s are used it is 
advisable to distinguish the symbols by subscripts, using 
Sioev and cTiogi, as measures of the logarithmic variation 
about the fitted curve and about the arithmetic mean 
of the logarithms of the F’s, respectively. If jSiog » is reduced 
to ratio form, it may be written Sr. Since the index p 
must be interpreted Somewhat differently in this ease, it 
may be written piog y log *, or piog yx. 

The Use of Recipkocals in the Measueembnt of 
Relationship 

Another type of curve may be used to describe the 
relationship between the production and price of oats, and 
its use introduces us to a third field of correlation, a field 
in which somewhat new concepts enter, and in which the 
various measures must be interpreted in still another way. 


HARMONIC EQUATION 

This is a curve of the type 


a + 5Z 

which may be expanded by adding additional terms to 
the denominator, as 

Y 1 

a + bX + cX^' 

This hyperbolic form has been used in several studies as 
an approximation to a “demand” curve for various eoni- 
modities. 

The equation to a curve of this type may be written 
~ = a + bX 

which is the equation to a straight line describing the rela- 
tionship between the reciprocals of the F’s and the original 
X values. The normal equations required in fitting a 
curve of this type are 

I = Aa + bZ(X) 

Tl = aS(J) + 6S(J)t 

The method of computing the necessary values is illustrated 
in Table 134. 

Substituting the proper values in the normal (iquations, 
we have 

34.3360320 = 33a + 33.338b 
35.2571485 = 33.338a + 34.1685546. 


a ~ — 1357 
b = 1.1643. 


Solving, 
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Table 134 

Computation of Values Required in Fitting a Curve to Data of Oai 
Production and Prices 


( 1 ) 

(2) 

(3) 

Produc-^ 

Year 

Price 

tion 


Ratio 

Ratio 


Y 

X 

1881 

1.30 

.929 

1882 

1.05 

1.036 

1883 

.90 

1.156 

1884 

.85 

1.128 

1885 

.84 

1.165 

1886 

.77 

1.108 

1887 

.96 

1.124 

1888 

.79 

1.151 

1889 

.81 

1.188 

1890 

1.48 

.798 

1891 

1.10 

1.088 

1892 

1.09 

.943 

1893 

1.16 

.882 

1894 

1.07 

.886 

1895 

.75 

1.070 

1896 

.76 

.983 

1897 

.96 

.969 

1898 

.95 

1.005 

1899 

.83 

1.074 

1900 

.86 

1.033 

1901 

1.37 

.857 

1902 

1.03 

1.131 

1903 

1.14 

.911 

1904 

,86 

1.033 

1905 

.86 

1.090 

1906 

1.04 

1.013 

1907 

1.31 

.770 

1908 

1.29 

.796 

1909 

1.03 

.978 

1910 

.81 

1.064 

1911 

1.14 

.810 

1912 

.80 

1.221 

1913 

.87 

.948 

Total 

32.83 

33.338 


Example II 
(4) (5) 


Y 

.7692308 

.9523810 

1.1111111 

1.1764706 

1.1904762 

1.2987013 

1.0416667 

1.2658228 

1.2345679 

.6756757 

.9090909 


X 

Y 

■ 7146154 
. 9866667 
1.2844444 
1.3270588 
1.3869048 
1.4389610 
1.1708334 
1.4569620 
1.4666667 
.5391892 
.9890909 


( 6 ) 


.59171602 

.90702957 

1.23456788 

1.38408307 

1.41723358 

1.68662507 

1.08506951 

1.60230736 

1.52415790 

.45653765 


.9174312 

.8651376 

.8416800.1 

.8620690 

.7603449 

.74316296 

.9345794 

.8280373 

.87343865 

1.3333333 

1.4266666 

1.77777769 

1.3157895 

1.2934211 

1.73130201 

1.0416667 

1.0093750 

1.08506961 

1.0526316 

1.0578948 

1.10803329 

1.2048193 

1.2939759 

1.45158955 

1.1627907 

1.2011628 

1.35208221 

.7299270 

.6255480 

.53279343 

.9708738 

1.0980583 

.94259594 

.8771930 

.7991228 

.76946756 

1.1627907 

1.2011628 

1,35208221 

1.1627907 

1.2674419 

1.35208221 

.9615385 

.9740385 

' '.92455629 ' 

.7633588 

.5877863 

.58271666 

.7751938 

.6170543 

.60092543 

.9708738 

.9495146 

.94259594 

1.2345679 

1.3136802 

1.52415790 

.8771930 

.7106263 

.76946756 

1.2500000 

1.5262500 

1.56250000.. 

1.1494253 

1.0896562 

1. 32117852 .' 

34.3360320 

35.2571485 

36.85702940 


(7) 


.863041 

1.073296 

1.336336 

1.272384 

1.357225 

1.227664 

1.263376 

1.324801 

1.411344 

.636804 

1.183744 

.889249 

.777924 

.784996 

1.144900 

.966289 

.938961 

1.010025 

1.153476 

1.067089 

.734449 

1.279161 

.829921 

1.067089 

1.188100 

1.026169 

.592900 

.633616 

.956484 

1.132096 

.656100 

1.490841 

.898704 


The desired equation is, therefore, 

- .1367 + 1.1643Z. 


i 

Y 
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THE STANDARD ERROR AND THE INDEX OF CORRELATION IN 
TERMS OF RECIPROCALS 

To determine the utility of this equation we must have 
the standard error and the index of correlation. The 1 wo 
necessary formulas may be derived as in the preceding 

cases. Representing by y the reciprocal of an actual value 

we have, for each residual, 

d=-a + bX-i- 


(1) 


Multiplying by d and summing 

= a2(d)+6S(rfa-)-S 

Since 




we have 


S(d) = OandS(rfJ) = 0, 




(2) 


Multiplying the residual equation (1) now hyi-’ and sum- 
ming, we have 

s(^).as(i)+6s(f)-2(iy- 

Substituting the equivalent of S in the preceding equa- 
tion (2), we secure 

5;(#) = s(i)’ - «j(i) - ) 


and for *Sr, we have 


y 


s(r)' - as(i) - e(^.) 

■Mr 
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Inserting this value of in the general formula for the 


index of correlation 


= 1 


and simplifying, we have 


p-i. = 


if) + t<f) - 

■s(y) -NCi’ 


Inerting the proper values in these two equations, we find 

tiisit- 

-Si =.1191 

V 

Pi, = .766. 

y 

Foi the standard deviation of the original F-values in 
terms of reciprocals, we secure ’ 

0-1 = .1851. 

V 

(The subscript - is used in connection with each of these 

measures, as they should be distinguished from measures 
based upon natural numbers or logarithms.) 

, INTBEPRETATION OF THE STANDARD ERROR OF ESTIMATE 

How may we interpret these results? As in all former 
problems of tbs type the equation gives us a means of 
estimating F from a known value of X. The standard 
error Si_ serves as a measure of the reliabiUty of such 

estimates, and is an abstract measure of the degree of 

relationshp between the two variables. But in the present 
case all these measures are in terms of reciprocals. The 
equation enables us to estimate the reciprocal of F, the 
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standard error has significance only in the form of a redpro- 
calj and the value of p depends upon the relation between 
two measures (^Si^ and o-i®) both of which are in terms of 

V y 

reciprocals. 

An illustration may make these meanings clear. If, in 
a given year, the production of oats is 150 per cent of 
trend, what is the most probable price? Substituting in 
the equation 

- .1357 + 1. 1643Z 
a value of 1.50 for X, we have 

y = 1.6108 


We may expect a price approximately 62 per cent of trend. 
As a measure of the reliability of this estimate, we have 

-Si = .1191. 

y 

This must be applied to the estimate in terms of reciprocals. 
Thus we have 

1.6108 + .1191 = 1.7299 
1.6108 - .1191 = 1.4917. 

Reducing these reciprocals to natural numbers we sc(^urf^ 
.578 and . 670 as the desired values. The most probable 
price, then, is 62,1 per cent of trend, and, on the assump- 
tion of an approximately normal distribution of reciproc'als 
about the curve, the odds are 68 out of lOO that the price 
will fall between 57.8 per cent of trend and 67.0 per eeiit 
of trend. The limits of 2S and 3-8 may be similarly deter- 
mined by adding to and subtracting from the estimate, 
as a reciprocal, amounts equal to twice .1191 and three 
times .1191. The results secured may then be converted 
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to natural numbers. Just as with logarithms, the value 
in absolute terms of a given difference between reciprocals 
varies at different points within the range of F-values. 

Accordingly, the limits of reliability determined from 5, 

' 

should be expressed in natnral numbers only after a particu- 
lar estimate has been made. 

A Comparison op Measures of Relationship 

In interpreting p similar considerations enter. The value 
of the index of correlation, as we have seen, depends upon 
the degree of variation about the curve, as compared with 
the variation about the average of the original dependent 
series. In handling natural numbers, variability about the 
fitted line is compared with the variability about the 
arithmetic mean of the dependent variable, both measured 
in absolute terms (i.e., is compared with cr^). In handling 
logarithms, variability about the fitted line is compared 
with variability about the arithmetic mean of the loga- 
rithms of the dependent series, variability being measured 
in each case in terms of logarithms. But logarithmic 
deviations, as we have seen, may be interpreted in terms 
of ratios. The logarithmic deviations from the line represent 
the ratios of actual values to computed, while logarithmic 
deviations about the arithmetic mean of the logarithms of 
the original series represent the ratios of the actual values 
of the dependent series to their geometric mean. The value 
of piog V depends upon the relation between these respective 
deviations (i.e., Siogy is compared with o-!og„). 

In fitting a curve in which the reciprocals of the dependent 
variable are employed, variability about the fitted line is 
measured in terms of reciprocals, and the variability of 
the original series is measured in the same terms. That 
is, a I is computed from the differences between the recipro- 

y 

cals of the actual values and the arithmetic mean of all 
these reciprocals. But the arithmetic mean of these recipro- 
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cals is the reciprocal of the harmonic mean. Tims, in 
short, the value of the index of correlation, pj, depends upon 

V 

the relation between variability about the fitted line and 
variability about the harmonic mean of the dependent series, 
variation in both cases being measured in terms of reciprocals 
(i. e., -Si is compared with ctj). 

y y 

We have, therefore, three broad families of curves for 
describing the relationship between variable quantities. 
These are: 

1. Curves in the fitting of which natural values of the dei»endc)it 

variable are employed. Equations to all curve.s of thi.s rainily 
will be of the type 

F = /(X). 

2. Curves in the fitting of which logarithm,? of the de]ieiidi*m 

variable are employed. In all such cases the equations will be 
of the type 

logF=/(X). 

3. Curves in the fitting of which reciprocals of the dependent 

variable are employed. For these curve,? the equation.s will 
be of the type 

In any one of these three cases the equations may be 
linear or non-linear. In so far as this problem of interpreta- 
tion is concerned, there is no limitation as to the function 
of X which may be employed. (The computation H 
and p by the methods suggested above iin'oK'es certain 
limitations, which are outlined elsewhere.) 

The standard error of estimate for the first, family of 
curves is derived in terms of the original units of mea.sure- 
ment (for the dependent variable) and has a direct and 
simple meaning in these terms. The index of correlation, 
for curves of this t3rpe, is a measure of the degree to which 
the absoivte variability of the dependent variable may Ire 


586 THE PROBLEM OF ESTIMATION 


lessened by measuring deviations from the fitted curve 
instead of from the aniiimefo’c mean. 

The standard error of estimate for the second family of 
curves is derived, by the method outlined, in terms of 
logarithms. It is more convenient in general to give it mean- 
ing in terms of ratios. The index of correlation, 
is a measm-e of the degree to which the logarithmic or 
ratio variability of the dependent variable may be lessened 
by computing deviations (or ratios) with the fitted curve 
instead of the geometric mean as base. 

The standard error of estimate for the third family of 
curves is derived by the same process as in the other cases, 
but emerges as a reciprocal. The index of correlation, pj , 

is a measure of the degree to which the variability of the 
dependent variable, in terms of reciprocals, may be lessened 
by computing reciprocal deviations from the fitted curve 
instead of from the harmonic mean. 

FACTORS GOVERNING THE CHOICE OF MEASURES OF 
RELATIONSHIP 

It is clear, therefore, that the choice of a type of curve 
to describe a given relationship must be governed by basic 
considerations as to the type of average which is most 
appropriate as a measure of the central tendency of the 
given series. And this brings in a related question as to 
whether the dispersion about this average more nearly 
approximates the normal type when measured in absolute 
terms, in logarithms, or in reciprocals. In selecting a 
curve and in using the measures S and p there is always 
present an implicit assumption with respect to these points. 

When absolute values are important, and the dispersion 
of the dependent variable approaches the normal type when 
plotted on an arithmetic scale, measures of relationship of 
the arithmetic type would appear to be appropriate. But, 
as we have seen, in handling series in which rates of change 
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rather than absolute amounts of change are of primary 
importance and the dispersion appears to follow a geometric 
law, the arithmetic mean and other arithmetic measures 
are notoriously inadequate. In such cases logarithmic curves 
seem preferable to arithmetic, and measures of the reliability 
of estimates and of degree of relationship which are based 
upon ratios seem to be more suitable than those based upon 
absolute values. 

The harmonic mean has not been so widely employed as 
either of the above averages, and some attention may be 
given to principles governing its use in problems of the 
type here considered. In general, such harmonic measures 
are marked by the same weaknesses as the arithmetic, 
except that they err in the opposite direction. Geometric 
measures are perhaps better adapted to all-around employ- 
ment than either. Yet in one particular field of interest 
to the economist the harmonic mean is particularly appro- 
priate, and the utilization of reciprocals, as in the preceding 
example, seems to be justified. 

The use of the harmonic mean assumes a normal distribu- 
tion of reciprocals which, in natural numbers, means a 
much wider scatter above the average than below. The 
use of a curve of the type 

^ a -j- 5A 

involves a similar assumption as to the relation between 
F and X. A given absolute increase in X will be accom- 
panied by a certain decrease in the value of Y. The same 
absolute decrease in X will be accompanied by an increase 
in the value of F which is larger than the decrease regisf ered 
in the preceding case. But this is the relation which prevails, 
for many commodities, between the amounts produced and 
the price, the latter considered dependent. A given increase 
in production will cause some lowering of price. An equal 
decrease will cause a much greater increase in price. 
Moreover, when averaging the prices of such commodities 
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over a period, the harmonic mean may give a more typical 
value than any other average.^ In such cases there is a 
strong a priori justification for using a curve of the reciprocal 
type and measm'ing the accuracy of all estimates in terms 
of harmonic relations. 


ARITHMETIC, GEOMETRIC, AND HARMONIC MEASURES 

The contrast between these different methods may be 
brought home most effectively by comparing the results 
obtained when curves of these three types are fitted to the 
same data. The computations involved in fitting curves 
of the second and third types (logarithmic and reciprocal) 
have been illustrated with reference to the data of oat 
production and prices (Table 132). A straight fine (arith- 

1 “ Buyers and sellers of potatoes are frequently mistaken as to the price 
justified by fundamental economic conditions. If such an error is general in the 
fail, it may happen, for example, tliat the price which results is too high. If the 
price is too high in the early part of the season, potatoes will not be consumed 
fast enough to dispose of tiie supply available. Farmers and dealers will then 
find that not all of the stocks on hand can be sold at existing prices. Since 
potatoes can not be carried over from one year to the next, the price, under 
such conditions as have been mentioned, must be lowered enough to permit 
the supply to be disposed of before the end of the season. A properly adjusted 
price would remain the same throughout the season, except for a gradual ad- 
vance to cover cost of storage, and would maintain a fairly uniform consump- 
tion throughout the season. But since an abnormally liigh price early in the 
season causes small consumption, it must be compensated by an abnormally 
low price during the remainder of the season, or not all the crop can be sold. 

“Similarly, if the price is abnormally low early in the season, the supply will 
be exhausted too rapidly and those who still have potatoes will find that they 
can get abnormally high prices for them during the remainder of the season.’' 

But how, given the abnormally high or abnormally low prices during part of 
a season, may we compute the average price which would be justified by the 
true conditions of demand and supply, if these had been correctly estimated? 
Since “a low price during part of a season will be compensated only by a dis- 
proportionately high price during the remainder of the season” the arithmetic 
average for an entire season “will be somewhat higher tlian the average which 
would have resulted had a proper price been established at the beginning of the 
season. This difficulty is eltminaied by taking the harmonic mean of the monthly 
■prices.'^ 

Holbrook Working, Factors Determining the Price of Potatoes in Si. Paul and 
Minneapolis. Technical Bulletin 10, University of Minnesota Agricuitural 
Experiment Station, 8-10. 
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metic) is fitted to the same data, and the necessary accom- 
panying measures computed. The three sets of results are 
brought together in Table 135. 

Table 135 

Relation between the Production and Price of Oak, 1881-1913 

Conipanson of Results of Cuwe Fitting 
(Prices are the dependent variable in each case) 

Equation Standard aror 

of esUnmk rtinrJnliiJii 

A 1' = 2.24 - 1.2361' S„ = .12 - -78.3 

B i = - .1357 -t- 1.16431 Si = .1191 p, = .766 

^ 3/ 

C Log Y = - .00861 - 1. 18206 logl = 04901 P, 793 

It is impossible to compare the thi-ee standard error.s a.s 
they stand, since only the first one is in the original uni)-; 
of measurement (ratio of actual price to trend). In the 
following table are given estimates, based on each of the.sc^ 
equations, as to the most probable price (in terms of ratio 
to trend) which would accompany each of five different 
conditions of production.^ Each estimate is accompanied 
by a series of values which indicate the limits set by the 
standard error. Throughout, the values of the estimates 
plus and minus S, 2S, and SS are given, in order to indicia! e 
the probable scatter of actual values about the est.iinat(-s. 
The different amounts of variation which may be cxpinhul 
about each of the three lines of relationship are measured 
by the actual differences between the esfimates and th(! 
limiting cases. These differences are given in the columns 
headed A. All values in this table are comparable, being 
reduced to the original units (ratio of actual price to tnnid ). 

^ ^ For the purpose of this illustratioii-. the limits of actual obHervatioii liavc* 
been exceeded in setting ^up Table 136. Such extrapolation inrolves Ihc* 
sibility of errors of another sort. With these we are not here eofieornoi. 
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Table 136 


Comparison of Price Estimates and of Standard Errors of Estimok 
Based on Three Equations Relating to the Production and 
Price of Oats 


( 1 ) 

( 2 ) 

( 3 ) 


( 4 ) 1 

( 5 ) 

( 6 ) 


( 7 ) 1 

( 8 ) 

( 9 ) 


( 10 ) 

Value . 

Estimated 















ofX 1 
{mtio 

mine of Y 
(ratio of 




\Estimated 





Estimated 
value of y 





of 

price to 

Limits 

of 


r 

r ■ 

Limits i 

5 / 



from 

Limits 

of 



pro^ 

trendy 

arithmetic 

4 


estimate. 


i 

loffartth- 

logarithmic 

i 


d«c- 

from 

estimate 



reciprocal 



mic 

estimate 



iion to 1 

arithmetic 









equation 





noT- 

equation 










( C ) 





mal) 

(A) 

















+ 3S = 1 . 

.982 

4 . 

36 


438 = 11 . 

.223 

48 

,983 


438=3 

.114 

4 

890 



+ 2S = 1 , 

.862 

4 

24 


428 = 4 

.803 

42 

,663 


428 = 2 , 

,780 

4 

556 

.5 

1.622 

+S = 1 . 

.742 

4 

12 

2.240 

48 = 3 

.055 

4 

.815 

2.224 

48 = 2 . 

,491 

4 

267 



- S = l . 

.502 

— 

12 


- 8 = 1 

.768 

— 

.472 


-8 = 1 , 

,979 


,245 



~ 2S = L 

.382 

— 

24 


- 28 = 1 

.461 

— 

,779 


-28 = 1 

.779 


,445 



-38 = 1 

.262 

— 

36 


- 38 = 1 

.244 


,996 


-38 = 1 . 

,579 

- 

,646 



4-38 = 1 , 

.611 

4 

36 


438=2 

.281 

41 

.024 


438 = 1 

.786 

4 

,610 



4-28 = 1 

.491 

4 

,24 


423 = 1 

.794 

4 

,537 


428 = 1 

.596 

4 

,319 

.8 

1.251 

4s = i 

.371 

4 

,12 

1.257 

4S = 1 , 

.478 

4 

,221 

1.276 

48 = 1 

.429 

4 

,153 



- 8=1 

.131 

— 

.12 


, -s = l 

.093 

— 

.164 


-8 = 1 

,136 


,140 



-28 = 1 

.oil 

_ 

.24 


—23 = 

.967 


,290 


-28 = 1 

.021 


.265 



-38 = 

.891 

— 

.36 


-38 = 

.867 


.390 


-38 = 

.906 

- 

.370 



438 = 1 

.364 

4 

,36 


438 = 1 , 

.490 

4 

.518 


438 = 1 

.372 


.392 



428 = 1 

.244 

4 

.24 


428 = 1 

.265 

4 

,293 


428 = 1 

.225 

14 

.245 

1.0 

1 1.004 

4s = i 

.124 

4 

.12 

.972 

! 4S = 1 

.100 

4 

,128 

.980 

48 = 1 

.098 

4 

.118 



- s = 

.884 


.12 


-8 = 

.871 


.101 


-s= 

.872 


.108 



-28 = 

.764 

— . 

.24 


-28 = 

.789 

— , 

.183 


-28 = 

.784 

— 

.196 



-38 = 

.644 

— 

.36 


-38 = 

.722 

— 

.250 


.-38 = 

.696 


.284 



438 = 1 

.117 

T 

.36 


438 = 1 

.106 


. 313 ! 


438 = 1 

.106 

4 

.316 



428 = 

.997 

4 

.24 


428 = 

,977 

4 

.184 


428 = 

.987 

4 

.197 

1.2 

.757 i 

4S = 

.8771 

4 

.12 

.793 

48 = 

.875 

4 

.082 

.790 

48 = 

.885 

4 

.096 



-8 = 

.637 


.12 


-8 = 

.724 

— 

.069 


-8 = 

.703 


.087 



-28 = 

.517 

__ 

.24 


-28 = 

.667 

— 

. 126 : 


-28 = 

.632 

_ 

.168 



-38 = 

.397 

— 

,36 


-38 = 

.618 

— 

.176 


-38 = 

.561 


.229 



433 = 

.746 

4 

.36 


438 = 

.798 

"T 

.177 


438 = 

.852 

4 

. 246 '' 



423 = 

.626 

4 

.24 


428 = 

.728 

4 

.107 


428 = 

.761 

4 

.154 

1.5 

.386 

4S = 

,506 

4 

.12 

.621 

48 = 

.670 

4 

.049 

.607 

48 = 

.680 

+ 

.073 



-3 = 

.266 

— 

.12 


' -8 = ' 

.578 

— 

.043 


-s=. 

. 542 : 


.066 



-23 = 

.146 

— 

.24 


-28 = 

.541 


.080 


-28 = 

. 484 : 


-123 



-33 = 

.026 



.36 


-38 = 

.508 



.113 


- 38 =' 

.433 



.174 


Zones of Estimate and Their Significance 

A careful study of this table should make clear the nature 
of estimates based on the three types of equations here 
presented. The fundamental differences lie not so much 
in the actual values of the estimates, as in the standard 
errors which measure the reliability of these estimates and 
indicate the limits within which the actual values are likely 
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Fig. 87. — The Relation between the Production and J’rice of Oiit.H; 
Illustrating the Use of an Arithmetic Equation of Regre.ssion and Aritli- 
metic Zones of Estimate 


to fall. In other words, the differences lie in the assumptions 
made as to the character of the scatter about the curves. 

The measure Sy, which relates to the arithmetic curve, 
gives the same absolute range to errors of estimate whether 
the estimated value be higher low. An arithmetic dispersion 



Production Ratio 


Fig. 88. — The Eelation between the Production and Price of Oats: 
Illustrating the Use of a Logarithmic Equation of Regression and Geo- 
metric Zones of Estimate 

about the ctirve is assumed. In each ease the estimate 
is the arithmetic mean of the value which exceeds the 
estimate by an amoimt equal to Sy (or any multiple of Sy) 
and the value which falls below it by an equal amount. 
These coixditions are brought out graphically in Fig. 87. 





ZONES OF ESTIMATE 

The original points are plotted, the straight line of relation- 
ship (arithmetic) is shown, and zones of estimate having 



Fig. 89. .The Relation between the Production and Price of 
Illustrating the Use of a Logarithmic Equation of Regression and 
metric Zones of Estimate (Plotted on Double Logarithmic Paper) 


widths, respectively, of. 2S^ 4#S,, .aiid 6S, centering at the 
fitted line, are marked ont... 

The measure Si^gy gives the ■■;same ' relative or perc(«itagc^ 
range to errors ,, of ^ estimate, . whether the estimate In) higti 
or low. This means that the absolute range mdthiii wbieli 
the actual values should fall is much less when the estiniateH 
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are low than when they are high. It assumes a geometric 
dispersion about the curve which describes the relationship. 
The estimate is, in this case, the geometric mean of the 
value which exceeds it by an amount equal to Si^g „ (or 
any multiple of Siog„) and the value which falls below it 
by an equal amount. Fig. 88 presents these relationships 
graphically. The original data are here plotted, together 
with the graph of the equation 

y = .98O4Z-118206. 

There are shown, also, the limits of zones of estimate having 
widths equal, respectively, to 2Sr, 4Sr, and QSr, centering 
(geometrically) at the line of relationship. A comparison 
of Fig. 87 and Fig. 88 will reveal the differences between 
estimates based on the assumption of an arithmetic distribu- 
tion and those based on the assumption of a geometric 
distribution. 

The points and lines shown in Fig. 88 are plotted on a 
logarithmic scale in Fig. 89. On this scale the curve of 
relationship becomes straight, and the zones of estimate 
appear as symmetrical and of equal width throughout the 
range. This transformation when the data are plotted on 
logarithmic paper makes clear the fundamental simplicity 
of the assumptions involved in making estimates from 
logarithmic values. 

In using the measure Si we carry still further the assump- 

V 

tion that the variability about the curve is greater with 
high prices than with low. It shows a very limited range 
to errors of estimate when the estimate is low and a very 
wide range when the estimated price is high. A harmonic 
dispersion about the curve is assumed. The computed 
value, or estimate, is always the harmonic mean of the 
value which exceeds it by an amount equal to /Sj (or any 

multiple of 8i) and the value which falls below it by an 

TS 

equal amount. 
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Production Ratio 


.Fig. 90. The Relation between the Production and Price of Oats: 
Illustrating the Use of an Equation of Regression Based upon Eeciprcicak, 
and of Harmonic Zones of Estimate .. 

In Fig. 90 the .curve = - ,.1357’ + 1 . 1643 J is plotted, 

together with the: original' observations. Zones of estimate 
with widths of 25iy,4;S,i,,.and ,6/Sx, centering (harmonically) 

at the fitted line, are shown. The differences between this 



596 THE PROBLEM OF ESTIMATION 


figure and each of the two preceding are quite marked, 
particularly with respect to the zones of estimate. On the 
assumption of a normal harmonic distribution about the 
curve describing the relationship, the outer zone (with width 
equal to 6S) marks the limits within which 99.7 per cent 
of all the points should fall, and the inner zone (with width 
equal to 2»S) marks the limits within which 68 per cent 
of all the points should fall. By plotting reciprocals through- 
out, instead of natm’al numbers, this apparently abnormal 
distribution could be reduced to the symmetrical form 
secured in plotting the geometric values on the logarithmic 
chart. 

For both high and low estimates the geometric measure, 
Slog y, stands between the arithmetic measure, Sy, and the 
harmonic measure, Si. While the two latter have their 

V 

particular fxmctions, and are appropriate in certain cases, 
it is probably true that in using such methods as these in 
economic analysis, measures of the geometric family are 
more generally useful than those of the other types. This 
means, merely, that ratios are usually more important 
than absolute differences. It seems reasonable therefore 
to base estimates upon an equation of the type 

LogF=/(Z) 

and to measure the reliability of these estimates in tenns 
of logarithms or ratios, using S\og y or Sr. In such cases, as 
we have seen, correlation is measured by pbgi/iogi or PiogK^- 
The value of this index depends upon the ratio variability 
about the curve, as compared with the raiio variability 
about the geometric mean.^ 

^ The reasoning in C. M. Waish^s book, The Problem of Estimation (London, 
King, 1921, p. 12.) is peculiarly applicable to the present problem. Citing Gali- 
leo, in defence of the use of the geometric mean in averaging estimates, Walsh 
writes: ^*And so errors must be measured by an error which is a ratio between 
the estimate atid the true quantity, and not a concrete quantity itself. We cannot 
measure errors by so many pounds, feet or crowns; we must measure them 
by the proportions of the pounds, feet or crowns in the erroneous estimates to 
the pounds, feet or crowns in the thing estimated.'" (Italics mine.) This ar- 
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CHAPTER XVIII 


STATISTICAL INDUCTION AND THE PROBLEM 
OF SAMPLING, CONCLUDED 

The methods of induction discussed in an earlier section 
(Chapter XIV) dealt with the more familiar procedures 
employed in generalizing results secured from the study 
of samples. Certain research problems call for modifications 
of the methods there described, while for some purposes 
quite different instruments are needed. In the present 
chapter, therefore, we carry forward the discussion of 
statistical inference, considering methods appropriate to 
certain special conditions and special problems. 

Generalizing prom Small Samples 

The standard error of an arithmetic mean, we have seen, 
is given by 

where N is the number of observations in the sample and 
cr is the standard deviation of the population from which 
the sample is drawn. We do not know the standard deviation 
of the population but we approximate it from the standard 
deviation of the sample. (For convenience in this exposition 
we shall use s as a symbol for the standard deviation of 
the sample; a will denote the standard deviation of the 
population.) This is an acceptable approximation when 
N is reasonably large, say 30 or more. But for small values 
of N the standard deviation of the sample is subject to 
a definite bias, tending to make it consistently lower than 
the standard deviation of the population. The value of 
ffM derived by the customary method is also biased down- 
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I ward. Therefore, when methods appropriate to large 
samples are employed with small samples, we consistently 
I under-estimate the sampling errors to which our measure- 
ments are subject. This bias shows remarkable consistency, 
however. With samples of any stated size the magnitude 
of the error to be expected from the use of the standard devia- 
; tion of the sample as an approximation to the standard devia- 

j tion of the population may be determined, and correction 

made for it. Accordingly, generalization of results secured 
from small samples is possible. In the nature of things the 
margin of error in such generalization is larger than it is when 
large samples are used, but the distortion due to sheer 
bias may be avoided.* 

The nature of the error involved in generalizing from 
small samples may be brought out in the following terms. 
If we represent by M the mean of the population from 
which a sample is drawn, by X the mean of a single sample, 
and by cr- the standard deviation of a distribution of a 
number of X’s computed from successive samples, we may 
write 

r X-M 

1 <T- 

I X 

I The quantity T is the deviation of the mean of the sample 
from the mean of the population, expressed in units of the 
; standard deviation of the sample means. When is 
determined from the actual distribution of a miniber of 
or from the true standard deviation of the population 
I and iV of the. sample, the quantity T may be interpreted 
I as a normal deviate. The significance of given values of 

^ T may then, be determined with reference to a table of 

j areas under., the ■ normal .curve..-: , - Actually, we do not 

I have a large number of ■ X’s, which may be arranged in a 

{ frequency distribution, ' nor do we know the value of (T 

^ The Mas involved m the use of « as an approximation to ir, for small sam* 
I pies, was first discovered by ''‘Stiident.”: .For the original memoir see Ine 

' Probable Error of the Mean/” Biometrikai VoL 6, 1908, 1-25. 
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(the standard deviation of the population), nor of o-- (the 
standard error of X). We approximate cr by s (the standard 
deviation of the sample) and o-j by what we may call sj 

(s-x = if s has been computed from ‘ " * 


N 


s-x = 


Vn 


if s has been computed from ^ When these ap- 


proximations are based upon small samples, the T derived 
from them may not be interpreted as a normal deviate. For 
the distribution of T varies with the size of the sample. With 
small samples the distribution departs significantly from the 
normal type. Statistical inferences that fail to take account 
of this are inaccurate. 

A discussion in detail of the distributions of statistical 
measurements obtained from small samples would carry 
us beyond the scope of the present book. We may briefly 
note, however, certain characteristics of the distribution 
function of the standard deviation. These are effectively 
revealed by the results of an interesting experiment con- 
ducted by W. A. Shewhart. 

Shewhart drew 1,000 samples, each consisting of four 
observations, from a normally distributed parent population 
with a known standard deviation, equal to unity. V The 
standard deviation, s, of each sample was computed. The 
distribution of these thousand values of s is represented by 
the dots in Fig. 91.* (The line running through the dots 
defines the theoretical distribution of s’s to be expected, 
with samples of 4, on the basis of “Student’s” theory. 
There is a notably close agreement between the theoretical 
and observed distributions.) Traditional sampling concepts 
would lead us to expect a normal distribution of s’s, center- 
ing about 1, the value of (t in the parent population. 
Instead, the distribution is definitely skew, with the meas- 

^W. A. Shewhart, Economic Control of Quality of Manufactured Product, 
New York, Van Nostrand, 1931, 163-«173, 185-186. 

^ The figure is here reproduced with the permission of Dr. Shewhart and his 
publishers. 
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urements clustering about a central tendency well below' 
unity. The mode of the thousand values of s here repre- 
sented is, in fact, .717 and the arithmetic mean is 801. 
These s’s, it will be recalled, represent estimates of <t. 



0.075 0.375 0.675 0.975 1.275 1.575 1.875 


Standard Deviation 6 

Fig. 91 . — Distribution of Standard Deviations in Samples of Four Drawn 
from a Normal Universe 

There is a clear tendency for such estimates, based on 
samples of four, to understate the true value. 

The symbol T has been used above to define the deviation 
of a statistical measure from some standard or hypothetical 
value, expressed in units of the estimated standard error 
of the measure in question, when the deviation, so expressed, 
could be interpreted as a normal deviate. In the presold 
exposition we shall employ the symbols to relate to approxi- 
I mations to T when these approximations are based on 
I small samples. 

i The difference between T and ^ may be reduced to more 

j definite terms. If we let x = X — M, we may write 
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We may derive i from T; 



The normally distributed quantity, has been divided 
by the factor to give the quantity t. Opportunity to 
correct for the bias is given us, however, by the fact that 
the distribution of Si/vj is known. Thus the probability 
corresponding to any stated value of t may be determined 
. (when t defines a departure from a certain hypothetical 
value, measured in units of sj).^ 

It is of some interest to compare values of t corresponding 
to stated probabilities, for samples of varying sizes, with 
values of T corresponding to the same probabilities. T his 
is done in Table 137. 

The familiar values given in the customary table of 
areas under the normal curve appear on the last line of 

^ The degree of error involved in using s as an approximation to a, for small 
samples, is indicated by the following figures, taken from W. A. Shewhart 
{loc, cit.f 185). They define the relation between the modal for samples of 
size N drawn from a population of which the standard deviation is known, and 
the true cr of that population. 

Size of sample Modal s as a decimal fraction of true a 


N 


3 

.577 

■ 4 • 

.707 

5-'. ’ . 

.775 

6 • ■ 

.817 

7 ■ ^ 

.845 

8 

.866 

'■9 , 

.882 

10 

.894 

15. 

.931 

20 

.949 

25 ■ 

.959 

30 

.966 

50 

.980 

100 

.990 


The fractions given above define relations that are to be expected on the 
basis of error theory, as modified by “Student” to take account of conditions 
affecting small samples. The modal value of the 1,000 standard deviations 
obtained by Shewhart in his empirical test of this theory was, as we have seen, 
.717, of the standard deviation of the universe. This result is very close indeed 
to the expected value of . 707,’ for samples in which N = 4. 
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Table 137 

Values of t and T Corresponding to Stated Probabilities ^ 


Probability 


.80 

.50 

.40 

.20 

.10 

.05 

.01 

1 .325 

1.000 

1.376 

3.078 

6.314 

12.706 

63.657 

2 .289 

.816 

1.061 

1.886 

2.920 

4.303 

9.925 

3 .277 

.765 

.978 

1.638 

2.353 

3.182 

5.841 

4 .271 

.741 

.941 

1.533 

2.132 

2.776 

4.604 

5 .267 

.727 

.920 

1.476 

2.015 

2.571 

4.032 

6 .265 

.718 

.906 

1.440 

1.943 

2.447 

3.707 

7 .263 

.711 

.896 

1.415 

1.895 

2.365 

3.499 

8 .262 

.706 

.889 

1.397 

1.860 

2.306 

3.355 . 

9 .261 

.703 

.883 

1.383 

1.833 

2.262 

3.250 

10 .260 

.700 

.879 

1.372 

1.812 

2.228 

3.169 

20 .257 

.687 

.860 

1.325 

1.725 

2.086 

2.845 

30 .256 

.683 

.854 

1.310 

1.697 

2.042 

2.750 

00 .25335 

.67449 

.84162 

1.28155 

1.64485 

1.95996 

2.57582 


Table 137, for = oo . These are the values of T, as a nor- 
mal deviate, corresponding to probabilities of . 80, . 50, etc. 
Thus, when we are dealing with infinitely large samples, 
the probability of a given sample 5 delding a value of T 
as great as . 25335 or greater (either above or below the 
mean) is .80. (The area between the maximum ordinate 
and an ordinate erected at + . 25335 is 10 per cent of the 
total area under the normal curve. Twenty per cent of 
the total area will fall within ± ,25335, and 80 per cent 
will fall beyond these limits.) Similarly, Just 60 per cent 
of the values of T will exceed the limits ± .67449; 5 per 
cent will exceed the limits ± 1.95996; 1 per cent will 
exceed the limits ± 2 . 57582. 

As n grows smaller each of these limits must be extended, 
if the probabilities are to remain constant. For samples 
in which n is equal to 10, 50 per cent of the values of t will 

^ The entries in this table are extracts from a more detailed table (Table I¥) 
in R. A.' Fisher's Statistical Methods for Research Workers, Edinburgh,. Oliver 
and Boyd, sixth edition, 1936. The table is. printed here through, the courtesy 
of Dr. Fisher and his publishers. 
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fall beyond the limits ± .700; 6 per cent will exceed the 
limits ± 2 .228, and 1 per cent will exceed the limits 
± 3.169. (The letter n in Table 137 refers to the number 
of degrees of freedom in the computation of t. This general 
concept has been discussed in Chapter XV. When the 
arithmetic mean of a sample is being tested for significance, 
n = N — 1.) If in applsdng various statistical tests we attach 
significance to a given level of probabilities, such as 5/100 
or 1 /lOO, we must recognize that the values of t corre- 
sponding to these probabilities vary with n. Fortunately, we 
now know how these values vary and, using such a table as 
that given above, may make allowance for the variation. 

For convenience in exposition we have distinguished T, 
as a normal deviate, from t, a similar deviate relating to a 
distribution of quantities derived from small samples, and 
therefore not normal. The probabilities corresponding to 
a given value of T are not the same as the probabilities 
corresponding to an identical value of t. Indeed, these 
probabilities vary for the same value of t computed from 
samples of different sizes. The distinction between T and 
t need not be preserved, however. We may use t generally 
to define the deviation of a statistical measure from some 
standard or hypothetical value, expressed in units of the 
standard error of the measure in question. The quantity 
t is to be interpreted as a normal deviate when large samples 
are dealt with. The interpretation is modified in dealing 
with small samples, as we have seen. The nature of the 
modification required is shown by the entries in Table 137 
and in Appendix Table II. 

EXAMPLES OE TESTS BASED ON i-TABLE 

In determining whether the mean of a sample deviates 
significantly from any stated value we may compute t 
from the relation 
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where X is the mean of the sample, M is the stated value 
and Sx is an approximation to the standard error of X. 
For this approximation we have 


s 


where s is the standard deviation of the sample (here com- 
puted from ^ 


The value t, which for larger sam- 


ples we have interpreted with reference to a table of areas 
under the normal curve, we here interpret with reference 
to the special i-table for small samples. In using the /-table 
for this purpose we take n of that table as equal to iX — 1. 

For the six New England states the average earnings 
of factory workers in 1935,^ as indicated by census returns, 
were as follows ; 


Maine 

$ 851 

New Hampshire 

892 

Vermont 

940 

Massachusetts 

1,007 

Biiode Island 

938 

Connecticut 

1,016 

Average 

$ 940.67 


For s we obtain the figure $63 . 99. The standard error of 
« $63.99 


the mean is Sx 


= $26.13. 


Vn Ve 

Does the average of annual earnings of factory workers 
in the six New England states differ significantly from 
$1,022, the average for the country as a whole? Computing 
t we have 

, X- M $940.67 - $1,022 


- 3.11. 


$26.13 


^ These averages^, and similar ones cited below, are, derived '■ by dividing, the 
total wages paid by , the average number of wage-earners employed during the „ 
year. Part-time workers are included. ' ,The averages do not represent , full-time „ 
earnings, therefore. ' 
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Consulting the i-table with n = 5 we find that for P = .01, 
t = 4.032. The observed deviation is not as great as this. 
If our standard is a P of .01, the average for the New 
England states is not to be judged significantly less than 
the average for the country as a whole. If the standard 
were a P of .05, however, the deviation would be considered 
significant. 

Similarly, we may test with reference to the i-ta*ble the 
significance of a difference between two means, computed 
from small samples. In this case we obtain t from the 
relation 

- Z2 / NiN, 
s T Ai + As 


where the X’s and N’s have the customary meanings, and 
s is, in effect, an average standard deviation of the two 
distributions. For 


s = 



+ Sda' 

Aj + A 2 - 2' 


Here di and are used, respectively, to denote deviations 
of given observations from the means of the two distribu- 


tions. The value t, as derived above, corresponds to t = — , 

o'n 

where D is the difference between two means and o-^, is 
the standard error of that difference. For small samples, 
however, the customary formula for crZ is modified some- 
what, and the special i-table rather than the table of nonnal 
deviates is used. In consulting the <-table in a problem of 
this type, n is taken as equal to Ai -f iV 2 — 2. 

Average earnings of workers employed in manufacturing 
plants in six Southern states, in 1935, are shown below: 


North CaroHna 

, $662 

South Carolina 

615 

Georgia 

599^ 

Tennessee' , 

744 : 

Alabama 

■640'. 

Mississippi 

■::.'::54T ;■■■ 

Average 

;.$633,.50,, 
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Does this average differ significantly from the mean earnings 
in six New England states in the same year? For the com- 
putation of s we have 

, . ^ 20,491.3 + ^,153.5 . ^ 

and for t 

$940.67 - $633.5 /36 

$66.06 y 12 

= 8.05. 

In the Ltable, for n = 10, we find that the value of t corre- 
sponding to a F of .01 is 3 . 169. The present value is clearly 
significant. The two samples could not have come from 
one homogeneous parent population. 

The t-table has particular value in connection with the 
interpretation of coefficients of regression. We may have 
observed that a given variable, Y, appears to increase by 
a constant increment or at a constant rate as another 
variable, a:, changes in value. The degree of relationship 
between the two variables may be measured in terms of r, 
the coefficient of correlation, but special interest often 
attaches to the functional relationship and, in particular 
to the apparent regression of y on x. Does b*of the equation 
of regression 

F = a -f 6Z 

depart significantly from zero, or from some other value 
which has significance for the purpose in mind? Here we 
must judge h with reference to the sampling errors to which 
it is exposed. 

A general test of this type was applied in an earlier 
section (Chapter XIV), in seeking to determine whether 
average corn yield in Kansas had shown a significant decline 
over the period 1890-1933. For smaller samples we may 
compute t by exactly the methods there presented, but we 
should interpret L with reference to the special f-table 
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adapted to small samples. As a general formula we have 


(Th 

where & is a coefficient of regression and |S is a norm with 
reference to which we wish to judge the given value of b. 
For the standard error of b we have 






where 


1/ 


S(F - 


N 


(In these expressions, x = X — X, Y is an observed value 
of the dependent variable, and Yc is the corresponding 
computed value.) In interpreting the value of t thus secured, 
the i-table is employed wnth n = N — 2. 

This test may be extended to the comparison of two 
coefficients of regression. The series in Table 138 provide 
an illustration. 


Table 138 

Aggregate Values of Loans on Securities and Commercial Loans, 
Reporting Member Banks, Federal 
Reserve System, 1922-1929 
(In hundreds of raillions of dollars) 


Year 

Loam on 

Commiercial loans 

securities 

other loam^’) 

1922 

39 

'73" 

1923 

41 

78 

1924 

45 

80 

1925- 

53 

82 

1926 

■ 57 . 

86 

1927 

62 

, 87 

1928 

69 

89, , , 

,1929 

77 

. , 92 


For loans on securities the trend (i.e., the equation of 
regression of volume of loans on time) is defined by 
Fi = 30.63 + 5 . 49 A 1 . 
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The corresponding equation for commercial loans is 
Fs = 72.13 + 2.54X2. 

In each case the origin is at 1921. The eight-year period 
was marked by an increase of loans on securities which was 
much more rapid than the corresponding advance in eom- 
mercial loans. We must ask, however, whether the difference 
between the two coefficients of regression is really signifi- 
cant, if account be taken of sampling fluctuations. 

The coefficients to be compared are 

5i = 5.49 
&2 = 2.54. 

In testing whether bi - 62 is significant (i.e., deviates 
significantly from zero) we must compute 

^ _ h — ^2 
cr hi — 60 

(T.-. being, of course, the standard error of the difference 
between the two coefficients of regression. For this standard 
error we have 


cr bi-b-i 


.Jjl 

■ V S(xi 


where Xi and Xs are given values of the .two variables, 
expressed as deviations from their respective arithmetic 

means, and 

„„ 2(Fi-lT0^ + S(F2_yj:^. 

A1 + A2-4 

(S/ is a measure of the average scatter about the two lines 
of regression. 

In the present example we have 


Sy-^ 


S/ = 2.40 

ffb,— 6.. — ^ 

t = 


.338 


hi 


4.80 ^ 

42 : 

■^2 5.49,— 2.54 . _ g yg 
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For the interpretation of this value of t we enter the i-table 
with n ~ Ni + Ni - 4 = 12 . In this case the value of t 
far exceeds the value of 3.055, corresponding to P = . 01 . 
The results are not consistent with the hypothesis that 
the true value of 6 i — 62 is zero. The trends of the two 
series differ significantly. (Here, again, the reader should 
bear in mind that such tests of significance apply only 
with important qualifications to economic series that are 
ordered in time.) 

Sampling Eeeoes op Coefficients op Coeeblation 
Computed feom Small Samples 

As a general formula for the determination of the standard 
error of the coefficient of correlation we have made use of 

1 — 

VN ~ 1 

In error theory, the r that appears in the numerator of 
the right-hand member of this equation is the coefficient 
of correlation in the universe from which the sample in 
question is drawn. But this r is not known. Our best 
approximation to it is the r derived from the sample. Here, 
again, we face distortion in small samples, a distortion 
that is the greater the higher the value of the true correla- 
tion. The nature of this bias may be readily understood. 
If we are drawing samples from a universe in which the 
true value of r is - 1 - .95, the range of the possible variation 
of the sample r’s above the true r is only .05. But the 
range of possible variation below the true value is 1.95 
(i.e., from + .95 to — 1.00). Accordingly, a distribution 
of r’s obtained from a great many small samples from this 
universe will be sharply skew. An estimate of the true 
value based upon a sample value will be subject to corre- 
sponding bias. This bias will not be present when the 
population value of r is zero. (The distribution of sample 
r’s when the population value of r is zero will be symmetrical. 
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but will depart somewhat from the normal type in other 
respects.) It will not be pronounced when the samples 
are large, even for high values of r. But when samples 
are small and the population value of r departs materially 
from zero, substantial inaccmacy results from the use of 
the formula given above. 

Allowance may be made for this bias by use of the table 
showing the distribution of t, for samples of various sizes. 
R. A. Fisher has showm that the procedme employed in 
deriving t, in testing whether a coefficient of linear regression 
differs significantly from zero, may be used, with an algebraic 
modification of the mathematical expression, in determining 
the significance of r. If we are testing the hypothesis that 
a sample from which a given r has been computed was drawn 
from a population in which the true value of r is zero, we 
may compute t from the relation 

rVTT^ 

vr=-w' 

This is equivalent, of course, to dividing the quantity r - 0 
(i.e., the deviation of the given r from the hypothetical 
value of zero) by Vl — r-j^/N — 2. In consulting the 
iS-table for the interpretation of the values thus obtained, n, 
the number of degrees of freedom, is taken as equal to iV - 2. 

As an illustration, w'^e may test the results obtained from 
a study of the relation between the production and the 
price of cotton in the United States, covering 35 observa- 
tions. The value of r is — . 65. We have 

, -.65V8V:r-2 ^ ^ 

Vl -(- .65)’ 

In consulting the t-table we find that for n = 33 the value 
of t corresponding to a probability of 1 per cent ^ is approxi- 

^This probability refers to the likelihood of deviations above or below the 
assumed true value of zero. It corresponds to the sum of areas at both extrem- 
ities of a frequency curve. We may divide it by two to obtain the probability 
of a deviation of the stated magnitude in one direction only from the hypothet- 
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Table 139 


Values of the Correlation Coefficient for Different Levels of 
Significance ^ 


n 

P = .05 

P = .02 

P =.01 

1 

.996917 

.9995066 

.'9998766 

2 

.95000 

.98000 

.990000 

3 

.8783 

.93433 

,95873 

4 

.8114 

.8822 

.91720 

5 

. 7545 

.8329 

.8745 

6 

7067 

.7887 

.8343 

7 

.6664 

.7498 

7977 

8 

.6319 

.7155 

. 7646 

9 

.6021 

.6851 

. 7348 

10 

. 5760 

.6581 

.7079 

11 

.5529 

. 6339 

.6835 

12 

.5324 

.6120 

.6614 

13 

.5139 

.5923 

.6411 

14 

. 4973 

. 5742 

.6226 

15 

.4821 

.5577 

.6055 

16 

.4683 

.5425 

.5897 

17 

.4555 

.5285 . 

.5751 

18 

.4438 

.5155 

.5614 

19 

.4329 

.5034 

.5487 

20 

.4227 

.4921 

.5368 

25 , 

.'3809 

.4451 

.4869 

3 o : 

.3494 

.4093 

.4487 

35 • . 

.3246 

.3810 

.4182 

■ 40 . 

.3044 

.3578 

.3932 

45 ' 

.2875 

.3384 

.3721 . 

50 

.2732 

.3218 

.3541 , ' 

■ 60 . ' ■ 

.2500 

.2948 

.. ..3248 , 

70 . , . 

.2319 

.2737 

. 3017 ' 

80 

.2172 

.2565 

.2830 

90 

.2050 

.2422 

„ . 2673 . . 

100 

. 1946 

.2301 

,.2540 


ica! value. In most problems of the type here discussed it is conservative prac- 
tice to test given results with reference to the probability of a deviation of 
given magnitude, without consideration of the direction of deviation. The 
tabulated values of t lend themselves to this procedure. 

^This table is printed here through the courtesy of R. A. Fisher and his 
publishers, Oliver and Boyd, of Edinburgh. The original appears as Table V.A 
of Staiistical Methods for Research Workers, 
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mately 2.73. If the true value of i were zero, a value as 
great as 2.73 or greater would occur only 1 time out of 
100, as a result of chance fluctuations of sampling. The 
present value of t is substantially greater than 2.73. It 
it highly improbable that it reflects a chance drawing from 
a population in which the true value of t (and, of course, 
of r) is zero. The results we have obtained are not, then, 
consistent with the hypothesis that the true value of r is 
zero. There appears to be a significant negative correlation 
between the production and the price of cotton. 

If we are seeking to determine the significance of given 
eoefiicients of correlation with reference to hypothetical 
values of zero, use may be made of a table prepared by 
R. A. Fisher, showing the values of correlation eoefiicients 
at stated levels of significance. Selected values from this 
table are given in Table 139 and in Appendix Table III. In 
simple correlation problems, this is to be read with n equal 
to iV — 2 (the number of pairs of original observations 
less 2). In determining the significance of eoefiicients of 
partial correlation the number of variables held constant 
K also subtracted from N. 

The use of the table requires little explanation. If a 
sample is based on 12 pairs of observations’, with n equal 
to 10, we would require a coefficient at least as high as 
.7079 before we accept it as significant, if our standard 
of significance is P = .01. For only 1 time out of 100 
trials would a sample of 12 drawn from an uncorrelated 
population yield a value of r as gi-eat as . 7079. If our 
standard of significance is P = .06 we would accept as 
significant of a real relationship an r of .5760, or greater, 
obtained from a sample of 12. 

TRANSFORMATION OP r TO 2 

The sampling limitations attaching tor have led R. A. Fisher 
to utilize as a general measure of linear coirelation a loga- 
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rithmic function of r that possesses certain distinctive merits. * 
In effecting the transformation we have 

s = I {logs (1 + r) - loge (1 - r) } . 

Conversely 

j. _ ^ (g2J _j_ 

The scales of possible values of r and z are, of course, quite 
different. For r = 0, z = 0, and for r = 1, 2 = <». Nega- 
tive values of r give negative values of z. The relations 
between the two functions, at different levels of correlation, 
are shown by the entries in Appendix Table IV. Transfor- 
mation may be more readily effected by means of this 
table than from the relations given above. 

There are certain highly Important advantages in this 
transformation. Not least is the replacing of r by a function 
with a distribution of values corresponding more closely 
to the true significance of observed correlations than do 
those of r. Thus a change in the value of r from . 88 to 
. 98 is equivalent, on the r scale, to a change from . 20 to 
.30. But the first of these differences represents, on the 
z scale, a change from 1 . 38 to 2 . 30 (a range of . 92) while 
the second represents a change in z from .20 to .31 (a 
range of .11). The first difference, on the z scale, is over 
8 times more significant than the second. In this the 2 
scale gives a far more accurate representation of the true 
significance of observed correlations than does the r scale. 

More important than this, however, is the fact that the 
distribution of 2 is much closer to the normal type than is 
that of r; in particular, the distribution of 2 is not subject, 
as is that of r, to marked variations in form with variations 
in the degree of correlation in the population. The form 
of the distribution of z is virtually independent of the 
degree of correlation. As a result, the sampling errors to 
which 2 is exposed may be estimated with considerable 

^ See Statistical Methods for Research Worker Chapter VI. 
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accuracy. For the standard error of z we have 

1 

Oz = " 

ViV-3 

This standard error, it is to be noted, is a function solely 
of A. It is independent of the true value of z in the parent 
population. 

From the example in Chapter XVI we obtained a coeffi- 
cient of partial correlation of — .2923 between corn yield 
per acre in Kansas and average June temperature, holding 
constant effects of changes in July and August tempera- 
tures. Referring to Appendix Table IV we have, for 
r = — .2923, z = — .301. In computing the standard 
error of a coefficient of partial correlation we must subtract 
from N the number of variables held constant. Since IV 
equals 44 in the example in question, we treat the coefficient 
of partial correlation as we would a simple coefficient based 
on 42 observations. For the standard error of z we have, 
then. 


(Tz = 


V42 


.160. 


With reference to this result we may determine whether z 
differs significantly from zero. For the test we must have 




- 1 . 88 . 


We interpret 1.88 as a normal deviate. It is clear that it 
is not large enough to indicate that z is significant. The 
result is not inconsistent with the hypothesis that the true 
value of z (and hence of r) is zero. 

If, however, we test the coefficient rw.as = — .4057, from 
the same example (defining the relation between corn yield 
per acre and August temperature, with June and July tem- 
peratures held constant), we have 

2-0_-.430 

<Jz .160 


2.69. 
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This result is clearly significant. So, also, is the measure 
/’is. 24 = — .6101, the coefficient of partial correlation be- 
tween corn yield and July temperature, with June and 
August temperatures held constant. 

The procedure would be similar, of course, if we were 
testing the significance of the deviation of an observed 
value of z from a theoretical value other than zero. 

The transformation to z makes possible, also, an accurate 
test of the significance of the difference between two 
observed correlations. The standard error of the difference 
between two values of z is given by 



where Ni is the number of pairs of observations in the first 
sample. Ns the number in the second. 

This test may be illustrated with reference to observations 
on the timing of price changes during business cycles. 
For 111 commodities we have observations on the timing 
of price declines in two successive periods of business 
recession occurring in the late 90’s and early 1900’s. The 
degree of relation between the time sequences of commodity 
price changes 4n these two recessions is indicated by a 
coefficient of correlation of -1- . 22. For two similar (suc- 
cessive) periods in the 1920’s the measure of correlation, 
based on the prices of 121 commodities, has a value of 
-+- .36. There appears to have been a closer approach to a 
common pattern in the later period than in the earlier. 
In testing the significance of the difference between the two 
results we set up the hypothesis that the two samples were 
drawn from the same parent population, and that therefore 
the true value of the difference between the two coefficients 
IS zero. • \’ 

For the two samples we have 

r - .22; 2 . .223; i - .0093 
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’• = ^ fAs - m - oo*^- 

The difference to be tested is 

D, = .377 - 223 = .154. 

The standard error of this difference is 


= V.OOQS + .0085 = .133. 


We wish to know whether Dz is significantly different from 
zero. We compute, therefore, 


£>. - 0 


154 - 0 
“ . 1.33 


= 1.16. 


Interpreting 1.16 as a normal deviate, we conclude that 
the difference is not significant. Lz differs from the hypothet- 
ical value of zero by only slightly more than one standard 
de\dation. The results are not inconsistent with the 
hypothesis that the two samples are drawings from the 
same parent population. There is here no clear evidence 
that the degree of relationship between price movements 
in successive cycles was closer in the 1920’s than in the 
earlier period. 

Finally, making use of the z-transformation, we may 
combine results secured from the measurement of corre- 
lation in different samples. If we have two values of r, 
obtained from samples drawn from the same population, 
a weighted average of the two will provide a better estimate 
of the true correlation than will either of the r’s, taken 
separately. For the averaging process we transform the 
r’s to s’s, weight each z by the corresponding A, less 3, 
and average them. Then, if desirable, the corresponding 
value of r may be determined. We may note that the 


^Tiie time factor enters to cloud statistical inductions relatiiig to samples 
drawn from different periods (see above, Chapter XIV). Such an induction 
should be suppoi'ted by evidence indicating that fundamental conditions in the 
field in question have not been altered over the time interval involved. This 
caution does not, of course, affect the procedure illustrated above. 
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standard error of the weighted average of the two ^’s is 
given by 


V 


(jv, - 3) + (N, - 3) 


The Chi-Square Test 

One of the great contributions of Karl Pearson to statisti- 
cal methodology was the determination of the form of 
the distribution of Chi-square, and the development of 
methods of utilizing this distribution. The character of 
this distribution and various tests based on it are our 
concern in the present section. 

THE NATURE OF CHI-SQUARE AND ITS DISTRIBUTION 

The quantity Chi-square (represented always by the 
symbol x^) is a measure of the degree to which a series of 
observed frequencies deviate from corresponding theoretical 
or hypothetical frequencies. The theoretical frequencies 
are set up on the basis of some hypothesis, some rational 
argument. The magnitude of the discrepancy between 
theory and observation is defined by the quantity x^- It 
was Pearson’s "contribution to determine the nature of the 
distribution of the values of x* that would be obtained 
under given sampling conditions. Knowledge of this dis- 
tribution enables us to determine whether a given discrep- 
ancy between theory and observation may be attributed 
to chance, or whether it results from the inadequacy of 
the theory to fit the observed facts. This instrument is 
obviously one of extreme importance in statistical analysis. 

The character of the distribution of x^ may be discussed 
with reference to Weldon’s date relating to the results 
obtained in 4,096 throws of 12 dice (see page 433). We 
call a 4, 5, or 6 spot a Success, a 1, 2, or 3 spot a failure. 
When 12 dice are thrown the expected (or theoretical) 
number of successes on each throw is 6. A deviation from 
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6 represents a discrepancy between expectation and observa- 
tion. From the result of each throw of 12 dice a value of 
may be computed. Thus, a given throw yields 2 successes 
and 10 failures. The 2 successes represent a deviation of 4 
from the expected value of 6; the 10 failures represent a 
deviation of 4 from the expected value of 6. (In such an 
experiment as this there are two components of each value 
of x^ even though when one component is given the other 
is necessarily determined. For the sum of successes and 
failures must be 12 on each throw.) The value of in 
a given instance is obtained by squaring the discrepancies 
between expectation and observation, dividing the squared 
values by the corresponding expected values, and adding 
the quantities thus obtained. That is 

/ 

where /o denotes an observed frequency and / defines the 
corresponding theoretical frequency. 

In the case cited above we have 

„ 5 . 333 . 

On another trial, with 7 successes and 5 failures, we have 
_ (7-6)= = (5- 6P 

A "T* 


6 


= .333. 


On still another trial, giving 6 successes and 6 failures, we 
have 




The 4,096 throws thus yield 4,096 values of Tabulating 
these with respect to the frequency of occurrence of stated 
values, we obtain the distribution given in Table 140 on 
page 620. 

This table gives us information as to the nature of the 
discrepancies between theoretical norms and actual results 
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Table 140 


Tabulation of 4,096 Observed Values of 


Value ^ of X"' 

(mecmifring deviation of 

Frequency of 

Frequency of 

obsewaMon from expec- 

occurrence 

occurrence 

tancy in dice-ihroiving 

(absolute) 

(:relatwe) 

experiment) 

Oto .833 

2,526 

.6167 

.833 to 2.167 

966 

.2358 

2. 167 to 4. 167 

455 

.1111 

4.167 to 6.667 

131 

.0320 

Over 6.667 

18 

.0044 

Total 

4,096 

1.0000 


that chance may bring about. For deviations from the 
expected frequency of successes, 6, may be attributed to 
the mass of undifferentiated causes we call chance. The 
magnitude of x“ varies, of course, with the degree of devia- 
tion. Values of not exceeding . 833 are most frequent. 
Higher values of x^ occm with decreasing frequency. Only 
18 out of 4,096 observed values of x^ exceed 6.667. This 
distribution furnishes us, therefore, with a standard of 
reference to employ when seeking to determine whether 
a given discrepancy between theoretical and observed values 
is attributable to chance, or whether it is too great to be 
so explained. 

This use of the table, as a standard for determining 
the probability that given discrepancies between theory 
and observation are attributable to the play of chance, 
is facilitated by a somewhat different arrangement. We 
may set up a table of cumulative values, based upon the 

^Tiie 4,096 values of tabulated -here constitute a discrete series. ' Tlie 
conditions of the experiment are such that the 4,096 observations on X" 
distributed among only six values, ranging from 0 to 8 . S33. In order that the 
observed frequencies of occurrence of stated values of x^ uiay be compared (in 
a later table) with theoretical frequencies, an uneven classdnterval is em- 
ployed above. Class limits are taken midway between successive values at 
which the actual observations fall. (The decimal fractions used in the table do 
not define these limits with full accuracy.) 
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tabulation of the 4,096 values of obtained in the preceding 
experimeiit. These are given in Table 141. 

The entries in coL (2) of this .table indicate that in the 
experiment involving 4,096 throws of dice, a value of 
of 6.667 or more occurs less frequently than 1 time out 
of 100 (only 44 times out of 10 , 000 , in fact). A value as 
great as 4.167, however, occurred more frequently than 
3 times out of 100. If we interpret these relative frequencies 
as probabilities, we may obtain from such a table a knowl- 
edge of the probabilities corresponding to stated values 
ofxK Here is the instrument we desire, in seeking to deter- 
mine whether given observations conform closely enough 
with expectations based on theory, or on working hypotheses 
which perhaps are not yet ready to be dignified as theories. 


Table 141 


Cumulative Relative Frequencies of Occurrerice of 4,096 Observed 
Values of x^) Corresponding Theoretical Frequemdes ^ 


( 1 ) 

Value of X“ 
(cumulaUve deviation 
of observation 
from expectancy) 

0 or more 
.833 or more 
2.167 or more 
' 4 .167 or more 
6.667 or more 


( 2 ) 

Relative frequency 
of occurrence 
{observed) 

1.0000 

.3833 

.1475 

,0364 

.0044 


(3) 

Relative frequmcy 
of occurrence 
{theoretical) 

1.0000 

.3613 

,1411 

.0412 

.0098 


We should note two important limitations attaching to 
the entries in col. (2) of the above table, showing relative 
frequencies corresponding to stated values of the 

first place, these are merely empirical results, obtained 
from a given set of experiments. The conditions of the 
experiment 3 deld a discontinuous series of values for x“- 
In some degree, this discontinuity has been ironed out by 

* One degree of freedom is present in the determination of a single value of 

in this example. ■ ' 
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the method of classification employed, but the instrument 
derived from this single experiment remains an imperfect 
one. The effects of chance fluctuations are present in 
these results, also, and contribute to the imperfection of 
the instrument. The true distribution of is only approxi- 
mated by the results presented in col. (2) of Table 141. 

The entries in col. (3) of Table 141 are free of this lim- 
itation. These record the frequencies with which values 
of X® falling within the limits indicated in col. (1) might 
be expected to occur, on the basis of mathematical theory, 
xmder the conditions of the present experiment.^ These are 
the entries which provide the standard we desire, in deter- 
mining the significance of a given series of discrepancies 
between observation and expectation. It is to be noted, 
however, that the empirically derived table constitutes a 
fair approximation to the theoretical distribution of x^ 
under these conditions. 

The second limitation attaching to the example cited 
above is that each of the 4,096 values of x® tabulated has 
two components, and that the experiment is such that 
when one component is given the second is necessarily 
determined. (Since there are 12 events in each throw we 
know, for example, that if we have 8 successes there must 
be 4 failures.) This condition is described by saying that 
there is but one degree of freedom in the derivation of 
a given value of x^- The table we have obtained relates, 
therefore, to a special case — the distribution of values 
of x^ computed with one degree of freedom. There are 
other possible cases. For each of these the distribution of x^ 
may be determined in a manner similar to that shown above. 

As an example of a different set of conditions we may 
consider the outcome of a throw of 24 dice, account being 
kept of the frequency of occurrence of each possible result 

^ These relative frequencies are taken from G. Udney Yule “Table of the 
values of P for divergence from independence in the fourfold table/^ 
of the Royal Statistical Society^ Vol. LXXXV, January, 1922, 103--104. 
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(i.e., the appearance of a 1, 2, 3, 4, 5, or 6 spot). When 24 
dice are thrown there may be expected 4 one spots, 4 
two spots, 4 three spots, etc. In a given throw we obtain 


the following results: 

1 

2 

Number of spots 

3 4 

5 

6 

Observed frequency 

2 

5 

6 

4 

4 

3 

Expected frequency 

4 

4 

4 

4 

4 

4 


For the results of this throw the value of Chi-square would 
be given by 

, (2 - iy . (5 - 4)2 (6 - 4)2 . (4 - 4)2 , (4 - 4)= 


This quantity has six components. However, as soon as 
five are given the sixth is determined, since the total number 
of events is fixed at twenty-four. There are, then, five 
degrees of freedom in the calculation of in this experiment. 

If the 24 dice were thrown a thousand times, say, we 
should have one thousand values of x^- A distribution 
of these could be constructed, similar to that derived 
empirically for the case in which there Was one degree 
of freedom. It would be a different distribution, however, 
for the change in degrees of freedom has an obvious relation 
to the magnitude of x®- The character of the distribution 
of the values of x® that would be obtained in such an experi- 
ment is indicated by the entries in Table 142 on page 624. 
We do not here give empirical values, as in the preceding 
example. The table shows the theoretical frequencies with 
which given values of x^ occur, when five degrees of freedom 
prevail. 

In using tables of this sort we may interpret meas- 
ures of .relative frequency as probabilities. Thus we may 
read Table 142, which relates to the distribution of 
computed with five degrees of freedom, as follows: If 
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Table 142 

Tabulation of Computed with Five Degrees of Freedom, with 
Cumulative Relative Frequencies ’ 


Value of X“ 

Relative freqiiency 
of occurrence 

0 or more 

{theoretical) 

1.0000 

1 or more 

.9626 

2 or more 

.8491 

3 or more 

.7000 

4 or more 

. .5494 

5 or more 

.4159 

6 or more 

,3062 

7 or more 

.2206 

8 or more 

.1562 

9 or more 

.1091 

10 or more 

.0752 

11 or more 

.0514 

12 or more 

.0348 

13 or more 

.0234 

14 or more 

.0156 

15 or more 

.0104 

16 or more 

.0068 

30 or more 

.000015 

00 

.000000 


the true value of is zero (i.e., in an infinitely large sample 
observed frequencies would agree precisely with the theo- 
retical frequencies we have set up), the probability of our 
securing a x^ of zero or more, from a sample of the type 
here employed, is 1.00; the probability of our securing 
a x^ of 1.00 or more is 9,626/10,000; the probability of 
our securing a x^ of 3.00 or more is 7/10; the probability 
of our securing a x^ infinitely large is 0. The quantities 
X® and P stand, thus, in a definite functional relationship, 
for any given value of w /n denotes the number of degrees 
of freedom). At the two limits the relationships are the 

^ From the table prepared by W. P. Elderton and given in Tables for Biaiis- 
tidans and Biomeiridans^ Karl Pearson, editor, 26* The n' of Elderton^s table 
is equal ton + 1, for an example of the type here given. 
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Table 143 ^ 


Tabk of for Selected Values of P and n 


n 

P == .99 

.95 

.50 

.10 

.05 

.02 

.01 

1 . 

.000157 

.00393 

.455 

2.706 

3.841 

5.412 

6.635 

^2 

,.0201 

.103 

1.386 

4.605 

5.991 

7.824 

9.210 

'3 

.115 

.352 

2.366 

6.251 

7.815 

9.837 

11.341 

: 4 

,.297 

.711 

3.357 

7.779 

9.488 

11.668 

13.277 

■5 

,„ .554 

1.145 

4.351 

9.236 

11.070 

13.388 

15.086 

6 

-.872 

1.635 

5.348 

10.645 

12.592 

15.033 

16.812 

7 

1.239 

2.167 

6.346 

12.017 

14.067 

16.622 

18.475 

8 

1.646 

2.733 

7.344 

13.362 

15.507 

18.168 

20.090 

9 

2.088 

3.325 

8.343 

14,684 

16.919 

19.679 

21.666 

10 

2.558 

3.940 

9.342 

15.987 

18.307 

21.161 

23.209 

11 

3.053 

4.575 

10.341 

17.275 

19.675 

22.618 

24.725 

12 

3.571 

5.226 

11.340 

18.549 

21.026 

24.054 

26.217 

13 

4.107 

5.892 

12.340 

19.812 

22.362 

25.472 

27.688 

14 

4.660 

6.571 

13.339 

21.064 

23.685 

26.873 

29.141 

15 

5.229 

7.261 

14.339 

22.307 

24.996 

28.259 

30.578 

16 

5.812 

7.962 

15.338 

23.542 

26.296 

29.633 

32.000 

17 

6.408 

8.672 

16.338 

24.769 

27.587 

30.995 

33.409 

18 

7.015 

9.390 

17.338 

25,989 

28.869 

32.346 

34.805 

19 

7.633 

10. 117 

18.338 

27.204 

30.144 

33.687 

36.191 

20 

8.260 

10.851 

19.337 

28.412 

31.410 

35.020 

37.566 

21 

8.897 

11.591 

20.337 

29.615 

32.671 

36.343 

38.932 

22 

9.542 

12,338 

21.337 

30.813 

33.924 

37.659 

40.289 

23 

10.196 

13.091 

22.337 

32.007 

35 . 172 ’ 

38.968 

41.638 

24 : 

10.856 

13.848 

23.337 

33.196 

36.415 

40.270 

42.980 

25 

11.524 

14.611 

24.337 

34.382 

37.652 

41.566 

44.314 

26 

12.198 

15.379 

25.336 

35.563 

38.885 

42.856 

45.642 

27 

, 12.879 

16.151 

26.336 

36.741 

40.113 

44.140 

46.963 

28 

13.565 . 

16.928 

27.336 

37.916 

41.337 

45.419 

48.278 

29 

14.256 

17.708 

28.336 

39 087 

42.557 

46.693 

49.688 

30 

14.953 

18.493 

29.336 

40.256 

43.773 

47.962 

,50.892 


same for all values of n. When = 0 > P = 100 ; when 
X“ = 00, P = 0.00. But for intermediate values the rela- 
tionship varies with n. 

In 1900 Karl Pearson defined the distribution function 

^This table is reproduced here through the courtesy of R. A. Fisher and his 
publishers, Oliver and Boyd, of Edinburgh. The entries are taken from 
Table in of Statistical Methods for Research Work&'s. 
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of X*- ^ The actual application of the x^ test is facilitated by 
prepared tables. Selected entries from these tabulations, 
for different values of n, are given in Table 143 on page 625 
and in Appendix Table V. 

For determining the significance of x® beyond the range 
of this table, Fisher has given v'2x^ — ^2/1-1, as a value 
which may be interpreted as a normal deviate. That is, 
the figure derived when stated values of x^ and n are inserted 
in the above expression is to be taken as a deviation from 
the mean of a normal distribution, expressed in units of 
the standard deviation of that distribution. The corre- 
sponding value of P is then derived from a table of areas 
under the normal curve. 

The x^ test is applicable to a considerable variety of 
problems. Wherever, on rational grounds, a set of theoretical 
frequencies may be derived, for comparison with observed 
frequencies, this test is appropriate in judging of the 
significance of the discrepancy between the two sets of 
frequencies. In the following pages three applications of 
this test are exemplified. 

THE CHI-SQTJAEE TEST OP GOODNESS OF FIT 

When an ideal frequency curve, whether normal or of 
some other type, is fitted to an actual frequency distribution, 
theory and observation are being compared. A test of the 
concordance of the two (i.e., of goodness of fit) may be 
made by inspection, but such a test is obviously inadequate. 
Precision may be secured by employing the x^ l^est. The 
example in Table 144, relating to the distribution of tele- 
phone subscribers discussed in Chapter XIII, illustrates the 
procedure. 

There are 16 classes in this distribution. Since the total 

^ Cf. the Griterion that a Given System of Deviations from the Probable 
In the\Gase: of a^ Gorrelated System of Variables is such that it ean be Reason- 
ably Supposed to have Arisen from Random Sampling.^^ Philosophical Maga- 
zine, 5th Series, VoL L, 1900. 
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Table 144 

Computation of for Testing Goodness of Fit 
Normal Curve of Error Fitted to Distribution of Telephone Subscribers 


(1) 

(2) 

(3) 

(4) 

(5) 

Class 
. limits 

Observed 

Theoretical 


(/o -/)“ 

, /■ 

frequency 

/o 

frequency 

f 

(fo-f) 

150 and less 

10 

13.14 

- 3.14 

.75 

■' 150-200 

19 

16.76 

+ 2.24 

.30 

, 200-250 

38 

31.57 

+ 6.43 

1.31 

250-300 

50 

53.02 

- 3.02 

.17 

300-350 

95 

79.43 

+ 15.57 

3.05 

350-400 

85 

106.10 

- 21.10 

4.20 

400-450 

115 

126.41 

- 11.41 

1.03 

450-500 

132 

134.31 

- 2.31 

.04 

500-550 

144 

125.50 

+ 18.50 

2.73' 

550-600 

116 

106.51 

+ 9.49 

.85 

600-650 

79 

81.85 

- 2.85 

.10 

650-700 

54 

55.21 

- 1.21 

.03 

700-750 

31 

33.19 

~ 2.19 

.14 

750-800 

11 

17.81 

- 6.81 

2.60 

More than 800 

16 

14.19 

+ 1.81 

.23 


995 

995.00 

15 groups 

= 17.53 


theoretical frequencies must equal the total observed fre- 
quencies, the entry in the fifteenth class is *fixed when the 
other 14 are established. The given value of 17.53, 
is determined, therefore, with 14 degrees of freedom. From 
Table 143 we see that when n = 14 a value of x® as great 
as 23.685, or greater, would occur purely as a result of 
chance in 5 out of 100 random samples, if the true value 
of x^ were zero. The value of 17.53 secured above is not 
excessively high, therefore. The discrepancies between the 
observed and theoretical frequencies in Table 144 could 
easily have arisen as a result of chance. The fit obtained 
with the normal curve is acceptable. Which is to say that 
our results are not inconsistent with the hypothesis that 
the normal law of error defines the distribution of residence 
telephone subscribers, classified on the basis of message use. 
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In applying the Chi-square test it is not necessary to 
determine the exact probability corresponding to a stated 
value of x^- Our purpose, in general, is to ascertain whether 
observed results are or are not consistent with the hypothesis 
on which the fitting procedure is based. For this purpose 
we wish only to know whether the value of P corresponding 
to a given value of falls below (or, much more rarely, 
above) certain critical values. As a conventional limit .05 
is usually employed. If a value of x^ is such that P is below 
.05, the discrepancies between observed and theoretical 
values are, on this standard, considered too great to be 
attributed to chance. The hypothesis on the basis of which 
the theoretical frequencies have been determined is suspect, 
in such a case. If x^ is large enough to give values of P 
below .02 or .01, the inadequacy of the hypothesis is, 
of course, more strongly indicated. 

R. A. Fisher points out that suspicion should attach 
to very low values of x‘, which give values of P of .99 
or thereabout. These values indicate a very close agreement 
between the hypothesis and the observed facts. Such close 
agreement may be due to chance, but there is strong proba- 
bility that the hypothesis is at fault or, in mathematical 
terms, that thd wrong function is being used. Coincidence 
of observed and theoretical values suggests the kind of 
agreement one obtains by fitting to n points a curve in 
the equation to which there are » constants. Any artificial 
forcing of agreement between hypothesis and observation 
of course invalidates the application of the Chi-square test. 

In applying the Chi-square test it is convenient to use 
the conventional standards we have noted, as guides to 
the rejection or provisional acceptance of working hypothe- 
ses. It is unwise to use these standards arbitrarily, however. 
No single standard possesses significance in any absolute 
sense. The investigator in a given field of research will 
interpret the information such a test yields in the light of 
other knowledge relating to that field of experience, and with 
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due regard to the rational foundation of the hypotheses 
being tested. 

One feature of Table 144 requires explanation. It will 
be noted that in the construction of this table the three 
classes at the lower end of the distribution have been lumped 
into one, and that the same thing has been done with the 
six classes at the upper end of the distribution (Of. Tables 
109 and 144). This is done to avoid the undue magnification 
of slight differences between the tails of the observed and 
theoretical distributions. I¥hen/, the theoretical frequency, 
is very small, a relatively slight absolute discrepancy between 
/o and / may serve to swell materially the value of The 
lumping process is designed to prevent such a distortion. 
Since the selection of classes for combination rests on the 
personal judgment of the investigator, a subjective element is 
necessarily introduced here. However, the results of the 
test will not usually be much affected by minor variations 
in the combination of tail-end classes.^ 

The use of x® in testing the fit of theoretical frequency 
curves is subject to another rather important limitation. 
In the computation of yj no account is taken of the distribu- 
tion of discrepancies between /o and /. Yet the manner in 
which these discrepancies are distributed anay materially 
influence our judgment as to the goodness of a given fit. 
In such an example as that given in Table 144, the successive 
values of fa — f, counting from the lower limit of the a:-scale, 
might be alternately positive and negative. Something 
approaching this alternation would be expected if chance 
factors alone accounted for the differences between observed 
and theoretical frequencies. But the differences might be 
distributed otherwise. All the values of /o — / below the 

^ Considerations of tiie same sort suggest that a sample of reasonable size is 
needed for the valid application of the Chi-square test in curve fitting. Demiiig 
and Birge set 500 observations as the minimum required in a test of this type, 
if confidence is to be placed in the result. Yule and Kendall suggest a smaller 
number, but place emphasis on the need of an adequate number of theoretical 
observations (preferably not less than 10); in every class. 
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mode might be positive, while all the values above the 
mode might be negative. The cumulated discrepancies, 
as measured by might be equal in the two cases, yet 
far more confidence would attach to a fit marked by alterna- 
tions of plus and minus deviations than to one in which 
a series of positive deviations were bunched together on 
the scale, and negative discrepancies were correspondingly 
clustered. This limitation serves as a warning against 
purely mechanical use of the x^ test. Examination of the 
fit, and interpretion of x® in the light of the actual distribu- 
tion of discrepancies, are required in the application of 
this test. 

THE CHI-SQUAEE TEST OP INDEPENDENCE OF PRINCIPLES OP 
CLASSIFICATION^ 

A question that frequently arises in research has to do 
with the relation between two principles of classification. 
Thus, in studying commodity price movements during 
revivals after business depressions, we may divide all com- 
modities into durable and non-durable classes. We may 
again divide them into those the prices of which precede 
the general average of commodity prices in the revival, 
and those that lag behind the general index. If the quality 
of durability has no relation to the timing of price recovery, 
the two principles of classification are independent. How- 
ever, certain considerations relating to the character of 
demand for durable and non-durable goods lead us to 
beheve that the durability of a good is related to the 
behavior of its market price during a period of business 
revival. It is possible to apply an objective test to determine 
whether these principles of classification are, in fact, related. 

Observed frequencies are recorded in Table 145.^ 

* For a discussion of tests of independence and homogeneity see Chapter IV 
of Statistical Methods for Research Workers, by R. A. Fisher. 

‘ Data from The Behavior of Prices, National Bureau of Economic Research, 
New York, 1927, with later additions. 
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Table 145 
Observation 

Two-fold Classification of 208 Commodities 


Commodity group Observed frequencies 



Number 

Number 



preceding 

lagging behind 

Total 


general index 

general index 


on price rise 

on price rise 


Durable goods 

6 

61 

67 

Non-durable goods 

50 

91 

141 

Total 

56 

152 

208 


The nature of the durability classification requires no 
explanation. The classification relating to the tuning of 
price changes in business revival is based on the average 
behavior of each of the 208 commodities during 13 periods 
of business revival occurring between 1890 and 1936. The 
process of cross-classification gives four “cells” among 
which the 208 conamodities are divided in the manner 
indicated in the table. 

With the observed frequencies that constitute the entries 
in these four cells we may compare a set of theoretical 
frequencies, derived from the hypothesis that the durability 
of economic goods has no relation to the timing of price 
advances after business depressions. These expected fre- 
quencies are computed readily from the sub-totals. The 
67 durable goods constitute 32.21 per cent of all the com- 
modities, while the 141 non-durable goods constitute 67.79 
per cent of the total. If dmability has no relation to the 
timing of price advance, after depression, we should expect 
the 56 commodities that preceded the general index to 
be divided between durable and non-durable goods in this 
same proportion. That is, 32.21 per cent of the 56 com- 
modities, or 18.04, should be durable, while 67.79 per 
cent of the 56, or 3^ 96, should be non-durable. Similarly, 
the 152 commodities lagging behind the general index in 
price revival should be divided between the durable and 
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non-durable categories in exactly the same way, 32.21 per 
cent in the durable class, 67 . 79 per cent in the non-durable. 
These expected frequencies, which conform to our hypothe- 
sis that the two principles of classification are independent, 
are given in Table 146. 


Table 146 


Expectation 

Two-fold Classification of 208 Commodities 


Comrmdity group 


Durable goods 
Non-durable goods 
Total 


Expected frequencies 


Number 
preceding 
general index 
m price rise 

18.04 

37.96 

56.00 


Number 
lagging behind 
general index 
on price rise 

48.96 

103.04 

152.00 


Total 


67 

Ml 

208 


Chi-square is computed from the relation x" =2 
in the following manner : 



,, _ (6 - 18.04)2 (50 - 37.96)2 

18.04 ' 37.96 


+ 


(61 - 48.96)2 
48.96 

(91 - 103.04)2 
103.04 


16.222. 


There are four components of Chi-square in this instance, 
but, as may readily be seen by reference to the table of 
expected frequencies, only one degree of freedom enters 
into its computation. The expected frequencies must yield 
the four group totals, 56, 152, 67, and 141. Accordingly, 
as soon as we fill one of the four cells set up by the process 
of cross-classification, the other three are definitely deter- 
mined. Given 18.04, the expected number of durable 
goods preceding the general index in price revival, the 
entries in the other cells are fixed. Subtraction of 18.04 
from 56 and 67 will fill two of them, and the filling of these 
determines the fourth. 
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For the interpretation of the given value of Chi-square 
we turn to Table 143, which is to be read with n, the number 
of degrees of freedom, equal to 1. If durability of economic 
goods has no relation to the timing of their price changes 
in revival, the two principles of classification employed 
above are independent and the true value of Chi-square 
I* is zero. Are the observed results consistent with this 
j hypothesis? The entries in Table 143 indicate that if the 
true value were zero, a value as great as 3.841 would 
occur 5 times out of 100, as a result of chance fluctuations. 
A value as great as 6 . 635 would occur only 1 time out of 
100. The present value of Chi-square, 16.222, represents 
a still smaller probability. The results are not consistent 
with the hypothesis we have set up. The differences between 
the observed and expected frequencies are too great to be 
' attributed to the play of chance. Durability, and factors 
of demand and supply related thereto, appear to play a 
definite role in the timing of price advances in business 
revivals. 

This test, it should be noted, does not define the relation- 
ship between dm-ability of goods and the timing of price 
revival. It leads us to reject the hypothesis that durability 
has no bearing on the sequence of price advances in revival. 
If, on the basis of some other rational hypothesis, we could 
obtain a set of expected frequencies representing a definite 
■ relationship other than one of independence, this h5q)othesis 
could be tested in the same manner. From the present 
I evidence, however, we may only conclude that the proportion 
of durable goods preceding the general price index on revival 
j is smaller and the proportion of non-durable goods larger 
than would be expected if durability had no relation to 
the timing of price recovery after a business depression. 

J THE CHI-SQUABE TEST OF HOMOGENEITY 

I For each of eight major industrial groups we have records 

I showing, for the year 1933, the number of corporations 
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reporting net incomes from their operations and the number 
reporting no net incomes (i.e., suffering deficits). The 
returns relate to a total of 492,649 corporations. Is this 
total a homogeneous whole, or does the division of corpora- 
tions between those earning net incomes and those suffering 
deficits vary significantly from group to group? The records 
appear in Table 147. 


Table 147 

Comparison of Observed and Theoretical Frequencies 
(Tabulations based on corporate income tax returns for 1933, by major 
industrial groups 0 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 



Actual 

number 

Theoretical 





Total 

of 

{expected) 




Group 

number 

returns 

number of 




of 

showing 

returns 





returns 

no net 

sho wing no 






income 

(/o) 

net income 

if) 

h-f 

(/o-/)2 

(A - /)= 

/ 

Agriculture and 







related indus- 
tries 

10,490 

7,818 

7,150 

+ 668 

446,224 

62.4090 

Mining and 







quarrying 

17,147 

8,866 

11,688 

- 2,822 

7,963,684 

681.3555 

Manufacturing 

93,833 

62,295 

63,958 

- 1,663 

2,766,569 

43.2404 

Construction 

18,234 

14,122 

12,428 

4- 1,684 

2,835,856 

228. 1828 

Transportation 

> 






and other pub- 
lic utilities 

24,302 

14,349 

16,564 

- 2,215 

4,906,225 

296. 1980 

Trade 

137,858 

93,621 

93,965 

- 344 

118,336 

1.2594 

Service 

47,843 

35,419 

32,610 

4- 2,809 

7,890,481 

241.9650 

Finance 

142,942 

99,314 

97,431 

4- 1,883 

3,546,689 

36.3918 

Total 

492,649 

335,794 

335,794 



1,691.0019 

Per cent 

100. GOO 

68.161 






The observed frequencies are, of comse, the actual returns 
given in col. (3) of Table 147. A set of theoretical or 
expected frequencies, for comparison with these, may be 
set up on the assumption that all corporations in the United 
States constituted a homogeneous population as regards 
the likelihood of suffering a deficit in 1933. On this as- 

^ From Statistics of Income for 19S3. U. S. Treasury Department, Washing- 
■ton,'P.;:G. ; 
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sumption we may say that the probability of failing to 
earn a net profit was, for ail the elements of this assumed 

335 

homogeneous population, 1 0 2 649 is the 

true probability for all elements of the population, we may 
determine a theoretical frequency for each industrial group 
by applying this ratio to the total number of corporations 
in that group. On the assumption made we should find, 
in all groups, the same proportionate division between 
corporations earning net incomes and those suffering deficits, 
except for modifications due to fluctuations of sampling. 
The expected frequencies appear in column (4). If the 
hypothesis of homogeneity is vahd, these are the true 
frequencies for the several groups. Differences between 
these and the observed frequencies reflect the play of chance 
alone. 

The calculation of x^, measuring the degree of discrepancy 
between the observed and theoretical frequencies, is shown 
in cols. (5), (6), and (7) of Table 147. The value of x^ 
computed with 7 degrees of freedom, is 1,591 . 0019. Since 
the 1 per cent value of x®? for n = 7, is only 18.475, the 
conclusion is clear that the discrepancy is too great to 
be attributed to chance. The results arc not consistent 
with the hypothesis of homogeneity. We are not justified 
in assuming that the forces affecting the profitability of 
corporate operations in 1933 were the same, among the 
eight major industrial groups here represented. 

The various procedures discussed in this chapter give 
some indication of the variety and power of the methods 
available for use in interpreting and appraising the results 
of statistical research. Each one involves, in some form, 
the testing of hypotheses against evidence yielded by the 
study of samples. It should be emphasized that the formal 
procedures described in the preceding pages are employed at 
a rather late stage in actual research work. The experiment 
will have been planned, the field work done, hypotheses 
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framed before the tests here discussed can be applied. 
These various steps must, of course, be coordinated. The 
data must be gathered with reference to the hypotheses 
to be tested and to the analytical methods to be employed. 
Acquaintance with appropriate techniques is one pre- 
requisite of intelligent planning of research in which quanti- 
tative data are utilized. Familiarity with the characteristics 
and limitations of the available materials, and clear definition 
of the questions at issue, are equally important elements. 
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THE METHOD OF LEAST SQUARES AS APPLIED 
TO CERTAIN STATISTICAL PROBLEMS 


The method of least squares in the case of a single 
unknown quantity is merely a procedure for obtaining the 
most probable value of that quantity from a number of 
separate observations. The most probable value is that 
for which the sum of the squares of the deviations (or 
residuals) is a minimum. This is the arithmetic mean of 
the observations. 


Where the measurements or observations do not relate 
directly to a single unknown quantity, but to functions of a 
number of unknown quantities, the problem is somewhat 
different. In the first case mentioned each observation is 
in the form of a single magnitude. In the present case 
each observation is in the form of an observation equation 
in which the observed values of the variables, as found in 
combination, are entered. The unknown quantities are 
the constants which define the functional relationship 
between the variables in question. Our problem is that 
)f finding the most probable values of these constants, the 
e values being unknown. 

As in the simpler case the most probable values are 
ose for which the sum of the squares of the residuals 
a minimum. In this ease, however, the residuals are 
eviations, not from a single magnitude, as in the case 
f the arithmetic mean, but from the cxuve which describes 
e most probable functional relationship. The residuals 
re the differences between the computed and the actual 
alues of the dependent variable. 
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DEEIVATION OF THE NOEMAE EQUATIONS 

Representing by Y an observed value of the dependent 
variable, by the corresponding computed value, by v the 
residual, or difference between Y and F„, and by Wi, W 2 , 
Wa, and Wi different independent variables (or different 
functions of a single independent variable), we may write 

Fe = /(TFi, W„ Wa, Wi) 
v = Yc-Y 

= f(Wi, Wa, Wa, Wi) - F 
= 2[/(IFi, Wa, Wa, Wi) - F]^ 

If the function in a particular case is of the type 
F. = oTFi + 5 IF 2 + cWa + dWi 

we have 

S(z)2) = S[(aTFi + hWa + cWa + dWi) - F]®. 

Our problem is that of determining the most probable 
values of the constants that define the function. These 
constants are represented, in the present case, by a, h, c, 
and d. (The TF’s, it should be noted, refer to quantities 
which are known, once the observation equations are given. 
In the usual case the TF’s are different functions of a single 
variable, but this is not essential.) On the assumption 
that the errors of observation are distributed in accordance 
with the normal law of error, it may be demonstrated 
that the most probable values of o, 6, c, and d, in the above 
equation, are those which render S(«®) a minimum; i.e., 

S[(aTFi + bWa + cWa + dWi) - F]^ = a minimum. (a) 

The normal equations necessary for the solution may be 
obtained by equating to zero the partial derivatives of 
the above expression with respect to the unknowns, a, b, 
c, and d. That is, we first differentiate the above function 
with respect to a, holding b, c, and d constant, then with 
respect to 6, holding a, c, and d constant, then with respect 



640 THE METHOD OF LEAST SQUARES 


to c, holding a, b, and d constant, then with respect to d, 
holding a, b, and c constant. Carrying through this operation 
with respect to a, we have 

^S[(aPFi + bW^. + cWi + dWi) - F]= = 0 

dct 

or 

I 2PFj[(aTU + bW2 + cW^ + o!TF4) - F] = 0. 

Differentiating equation (a) now with respect to b, we have 

+ bW2 + cW^ + dW,) - F]= = 0 


or 

II Sff"o[(aTFi + bWi + cWs + dW,) - F] = 0. 
Differentiating equation (a) with respect to c, 

^2[(aFi + bIFs + cWs + dTF4) - F]^ = 0 

oc 

or 

III Sff^sKaTFi + bWs + cTFs + dIF4) - F] = 0. 
Differentiating equation (a) with respect to d, 

~ S[(aIFi + bWi + cWz + dIF4) - F]^ = 0 
or 

IV SlF4[(aFi + bWi + cWs + dTF4) - F] = 0. 

The most probable values of the quantities a, 6, c, and 
d are secured by solving simultaneously the four normal 
equations thus obtained (numbered above I, II, III, IV). 

When the observation equations are all of the first degree 
(i.e., of the first degree with respect to the unknown quan- 
tities, a, b, c, etc.) the normal equations may be secured 
by the following process: 
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' 1, Write the equation which -describes the assumed relationship. 
The : observation equations are derived by substituting in this 
equation the observed values of the variables, as found in com™ , 
bination. 

2. Multiply each observation equation by the coefficient of the 
first unknown in that equation; the sum of the resulting equations 
constitutes the first normal equation. 

: '3. Multiply each observation equation by the coefficient of the 
second unknown in that equation; the sum of the resulting equa- 
tions constitutes the second normal equation. 

Continue this process until normal equations equal in 
number to the unknown quantities are obtained. 

The actual process of foi’ming the normal equations in 
curve fitting may be simplified, and the writing out of the 
separate observation equations avoided, as was demonstrated 
in earlier sections. The following may be laid down as 
general rules for the formation of the desired iiormal 
equations: 

1. Write the equation of the curve to be fitted. For the purpose 
of this explanation we may employ the general form 

F = aWi + bW2 + cWz + dWi + . . . (1) 

where F represents the dependent variable, a, h, c, d, . . . repre- 
sent the constants in the equation (the unknown quantities in the 
present instance) and Wi, W2, Wz, Wi, . . . represent the coeffi- 
cients of these unknowns. It is assumed that these coefficients 
represent variables, and that term is used with reference to them. 
Call this equation (1). 

2. Multiply each term in equation (1) by the coefficient of the 
first unknown in (1) (i.e., by JFi) and place the summation sign, 
S, before each variable. This is the first normal equation (I). 

3. Multiply each term in equation (1) by the coefficient of the 
second unknown (i.e,, by IF2) and place the summation sign before , 
each variable. This is the second normal equation (II), 

4. Multiply each term in equation (1) by the coefficient of the 
third unknown (i.e., by If 3) and: place the summation sign before : 
each variable. This is the third normal equation (III) . 

5. Multiply each term in equation- (1) by the ' coefficient of the 
fourth unknown (i.e.,' by- If 4); and-plaee. tfie summation sign before 
each variable. This is the fourth normal equation (HO- 
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The process may be continued until normal equations 
equal in number to the unknown quantities are obtained.^ 

A STANDARD SET OF NORMAL EQUATIONS 

As a set of generalized normal equations secured by the 
above process and applying to any equation which can be 
put in the form 

Y =-aWi + 6 F 2 + cTFs + dTF4 + . . • , 

we have 
I S(FiF) 

= oS(Fi^) + b'SiWiWi) + cSCTfiFs) + dZ{WiWi) + . . . 
II SCIFaF) 

- aZ{WiWi) + 62(^2^) + c2(F«IF3) + dZiWtWi) + . . . 

III ZdFsF) 

- aSiWiWz) + 62(IF2lF3) + c2(TF3-) + d2(TF3TF4) + . . . 

IV 2(F4F) 

= a2(TFiF4) + hSiWiWi) + c2(F3F4) + d2(lF4') + . . . 

By substituting for IFi, W 2 , Wg, Wi, etc., the particular 
functions employed in a given case, these equations may 
be readily adapted to any type of curve in the fitting of 
which the method of least squares is applicable. Thus in 
fitting a curve represented by the equation 

Y := a + bX + cX^ + dX^ 

substitutions in the standard normal equations given above 
are based upon the following relations: 

Fi = l 

W, = X 

Wg = 

Wi = XK 

The changes to be made in the normal equations are 
obvious. 2 (TFiF) becomes S(F); S(]Fi^) is equivalent to 
S(U), which is equal to N, the total number of observations. 

^ These rules represent an adaptation of a similar series formulated by 
Raymond FeBLilMM 
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The first norraal equation becomes 

S(F) = Na + 62(Z) + cSiX^) + dS(X*). 

The other normal equations are modified correspondingly. 

In the example just given, the coefficients are all different 
functions of a single independent variable, X. It is not, of 
course, essential to the method of least squares that this 
be so. The coefficients, TFi, W^, Wz, etc., may represent a 
number of independent variables, as in the case of multiple 
correlation. 

The limitations to the method of least squares must be 
borne in mind in making use of it. This method, in its 
direct application, is limited to cases in which the equation 
to the curve to be fitted is linear in the constants, i.e., the 
observation equations must all be linear as regards the 
unknown values, o, 6, c, etc. (This does not mean, of course, 
that the equation to the fitted curve must be linear.) As 
an example of this limitation, we may cite a curve having 
as equation y = al/", which caimot be fitted directly by 
the method of least squares. If the observation equations 
are non-linear they may be reduced to the linear form in 
many instances by the use of logarithms, and the method 
of least squares then employed. 

DEEIVATION OF THE FOKMULA FOB THE STANDARD EREOE 

OF ESTIMATE 

It has been pointed out in the body of the text that the 
standard error of estimate may be derived as a by-product 
of the method of least squares. A more complete demon- 
stration of this process may be given at this point. 

When the partial derivative of the expression 

S[(aWi -I- hW% -t- cWz -1- dW4) — F]* = a minimum 

is equated to zero, with respect to the first unknown, a, 
we have 

SFi[(aWi + hWi -f cWz -j- dWi) - F] = 0. 
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Since 

aWi + hWi + cWz + dW, -Y = v, 
we have as a necessary condition of fitting 

S(2)PFi) = 0. 

When the partial derivative of the same expression with 
respect to b is equated to zero, we have 

SIFo[(aFi + bWi + cTfa + ^^ 4 ) - F] = 0 

or, making the same substitution as in the preceding case, 

SCkTFs) = 0. 

Repeating the operation with respect to c and d, we may 
show that 

S(z)1F3) = 0 
and 

2(w1F4) = 0. 

In summary: When the method of least squares is 
employed in determining the most probable values of cer- 
tain unknown quantities, having as known coefficients the 
quantities IFi, TF 2 , IFa, TF 4 , the following relations hold 
as a necessary condition of the least squares method: 

S(t;TFi) =0 
SCwTFa) = 0 
S( 2 ;TF 3 ) = 0 
S(ylF4) = 0. 

A knowledge of these relationships gives us a method of 
securing readily the value S(»^) and the standard error of 
estimate. Assume that, by the method of least squares, we 
have determined the constants in an equation of the type 

Yc = aWi + bWi + cTFa + dWi. 

For each residual we have the relation 

V = oTFi -f bWi -b cW'i + dW, - F. 


( 1 ) 
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Multiplying throughout by v, and summing, we have 

= a S(w]Fi) + + cS(!)Fr3) + dL{vWi) - S(Fi-). (2) 


But 

S(«Fi) = 0 

2 :( 2 ;F 2 ) = 0 

S(z;F3) = 0 


therefore. 

2(2^IF4) = 0 



S(2;2) = - S(Fj>). 

(3) 


Multiplying each equation (1) throughout by Y, and 
adding, we have 

S{F») = aS(lFiF) + 6 S(TF 2 F) + cS(TFsF) + dS(F 4 F) 

-S(F2). (4)- 

Substituting in (3) the equivalent of '2{Yv), we have 

S(r') = S(F )2 - aS(TFF) - 6 S(TF 2 F) - cSCTFsF) 

-dS{WiY). (5) 


This gives us a method of obtaining the value S(»“) 
without computing the separate residuals, a method that 
is applicable whenever the equation of the curve to be 
fitted is of the form, or may be reduced by the use of loga- 
rithms, reciprocals, or other manipulation to the form 


I = gFi "j“ hlY2 “b cW 3 d'W i. 

In applying this to a particular case it is necessary only to 
replace TFi, W^, W 3 , PF' 4 , etc., by the functions that actually 
appear as coefficients of the unknown quantities in the 
original equation. Thus in fitting a curve the equation to 
which is 

Y = a + bX + cX^ + dX\ 


we find, as noted above, that 


TFi = 1 
W 2 = X 
W3 = X^ 

F4 = X\ 
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Making these substitutions in equation (5) above, we have 
= S(F2) - oS(F) - 6S(XF) - eS(XW) - d2{XW). (6) 

The standard error, Sy, is derived from the equation 


where d is used to represent a deviation from a fitted curve. 
The deviation, d, then, is but another term for the resid- 
ual V. Accordingly, as a general expression for the standard 
error of Y, with Wi, Wi, Wz, and TF 4 as independent 
variables, we have 

„ , SF^ - a 2 (WiF) - 6S(TF2F) - cSCTFsF) - dS(TF4F) 

^ ^ w 

As in the previous case, this may be applied to a particular 
problem by replacing Wi, Wi, W3, Wi, etc., by the actual 
coefficients of the unknown quantities. 

DEEIVATION OP THE FORMULA FOR THE INDEX OF 
CORRELATION 


We have adopted as an index of the degree of correlation 
between two variables the measure p (rho), derived from 
the equation • 

pV = 1 - ^ (8) 

assuming a single dependent variable, Y, and a single inde- 
pendent variable, X. With a single dependent variable, F, 
and a number of independent variables, Wi, Wi, Wz, Wi, 
the expression might be written 

V 2 

o2 = 1 _ 2 ?l. Cq') 

■ 


* Since our object is to measure the actual “scatter^’ about the fitted curve, 

■ X{d^) ■ ■ 

the formula ' ^ is used, rather than the formula. ^ 3 '']^" (where N repre- 


sents the number of observations md Nc the number of constants in the equa- 
tion to the fitted curve). The second formula would be used, in accordance 
with the theory of least squares, if we were seeking to determine the mean 
square error of an observation or of an observational equation. 
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Corresponding changes would be made in the subscripts for 
other changes in the symbols employed. The expression 
above is equivalent to 


p 


2 


1 


where y represents a deviation from an origin at the mean 
of the F’s. But 

_ S(F2) 

N' ~ N 


where Y represents the original values of the F-variable 
and Cy represents the difference between the original origin 
and the mean of the F’s. (The symbols Cy, and c* should 
not be confused wdth c, one of the constants in the equation 
to the fitted curve.) 

Accordingly, we have 


P^y-u 


= 1 - 




S(F2) - NCy'^ 


( 10 ) 


But we have secured an expression for S(p^) [the equivalent 
of S(d*)] which holds in the case of a curve fitted by the 
method of least squares. Substituting the equivalent of 
S(d^) in the above equation, and simplifying, we have, 
as a general formula for the index of correlation 

P’^y-wivJiWiWi . • . ™ ( 11 ) 

a2(TFiF) + hSiW^Y) + cSfTFsF) + cE(TF4F) + . . . - 
S(F0 - NCy^ 


This may be applied to a specific case by replacing Wi, 
If 2, If 3, If 4, etc., in the above formula by the functions 
which appear as coefficients of the unknown quantities in 
the original equation. When all these are functions of a 
single independent variable, as in the usual ease, the index 
of correlation would be represented by the symbol pyx- 


CEETAIN SPECIAL CASES 

In the case of multiple correlation, where the symbols 
Xi, Xi, Xs, X4, etc., are used to represent all the variables. 
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'nether considered dependent or independent, the symbol 
I is employed for the measure of correlation and numerical 
bscripts utilized as described in the body of the text. 

In the ease of a straight line relationship between two 
iriables, p is replaced by the sjnnbol r, which represents 
e ordinary coefficient of correlation. As the general 
uation for r we have 

„ aS(F) + l)S(ZF) - AV 

' SiY^) - NCy^ 

here are two special cases in which this formula may be 
nplified. If the origin be at the mean of the A’s, we 


d the formula for r reduces to 


, ^ bSixY) 

i ’’ S(F2)-iVc/ 

i the origin be at the mean of the F’s (it is not essential 
fat it be also at the mean of the X’s) 

^iy) — 0, and Cj = 0 

kd the formula for the coefficient of correlation becomes 

k this latter case the general formula for p may also be 
i^plified by the elimination of the terms aSiy) and Nc/. 


iSiCKS ON THE FORMATION OP THE NORMAL EQUATIONS 

There are so many possibilities of arithmetical error in 
e formation and solution of a set of normal equations 
ht . checks should be employed wherever possible. A 
jpvenient cheek on the calculations leading to the normal 
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equations is afforded by the introduction in each observation 
equation of an additional term, s, equal to the sum of all 
the known quantities in that equation. Thus, in the following 
system of observation equations, formed in fitting a line 
to the points 1, 3; 2, 4; 3, 6; 4, 5; 5, 10; 6, 9; 7, 10; 8, 12; 
9, 11, the values of s are as indicated: 

s 

3 = a + 16 5 

4 = a + 26 7 

6 = a + 36 10 

5 ™ 4“ 46 10 

10 = a + 56 16 

9 = a + 66 16 

10 = GE -}- / 6 18 

12 = a + 86 21 

ll=a + 96 21. 

(The coefficient of a in each case is 1, and this is added to 
the other known quantities.) 

In fitting a curve described by the type equation 

i = flll i T 6112 T g T dW i 

the following relations prevail between s and the other 
quantities computed. For each observation equation, 

Y +Wi + Wi + Wg + W4 = s. 

For the normal equations, 

S(lFiF) + SflFi^) + SfHfilFa) + SflfiFs) + SfFilFd = S(lFis) 
SflFaF) + S(lFilFa) + + SflFalFa) + = S(llV) 

ZdFsF) + SCTFiTFa) + SflF^lFs) + SflFs*) + 'EiWgWd = SClFss) 
2(1F4F) + SflFilFd + SflFaFd + SflFsFd + S(1F4') = S(lF4s) 

This form is capable of application to any specific problem. 
In each case the s-equations are formed in precisely the 
same way as the corresponding normal equations. 

In applying these checks several additional columns are 
needed in the working tables, but the extra trouble is 
more than compensated by the opportunity to check the 
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work at each stage. The application is illustrated in the 
following working table, showing the calculations involved 
in fitting a second degree curve of the form 

F = a + 6Z + 

to the nine points 1, 2; 2, 6; 3, 7; 4, 8; 5, 10; 6, 11; 7, 11; 
8, 10; 9, 9. 

Table A 


Illustrating the Use of Checks on the Formation of Normal Equations 


Y 

X 


XY 

XW 

s 

Xs 

X^s 

2 

1 

1 

2 

2 

5 

5 

5 

6 

2 

4 

12 

24 

13 

26 

52 

7 

3 

9 

21 

63 

20 

60 

180 

8 

4 

16 

32 

128 

29 

116 

464 

10 

5 

25 

50 

250 

41 

205 

1,025 

11 

6 

36 

66 

396 

54 

324 

1,944 

3,332 

11 

7 

49 

77 

539 

68 

476 

10 

8 

64 

80 

640 

83 

664 

5,312 

8,100 

9 

9 

81 

81 

729 

100 

900 

U 

45 

285 

421 

2,771 

413 

2,776 

20,414 


(Columns for and are omitted, as the values and S(Z^) may be 
derived from prepared tables.) 


Each of the. values in the column headed s is secured 
from the corresponding observation equation. Thus, from 
the first observation equation 

2 = la d” 15 "f" Ic, 

we have 5 as the value of s (2, plus the coefficients of the 
three constants). These values of s are secured readily 
from the table by adding the figures in the columns headed 
Y, X, and plus 1, the coefficient of the constant term a. 

Adding the various columns, the arithmetic work is 
verified by the following checks: 

S(F) +N + 2(Z) + S(Z2) = 2(s) 
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421 + 45 + 286 + 2,025 = 2,776 

S(X=>F) + S(Z“) + S(Z^) + S(X0 = S(Z»s) 

2,771 + 285 + 2,025 + 15,333 = 20,414. 

Further uses of a check of this kind are explained below, 
in discussing the solution of the normal equations. 

OTHER TESTS 

The possibility of checking the calculations in other ways 
has been suggested in the preceding sections. Thus, where 
the coefficients of the constants in the equation to the 
fitted curve are represented by Wi, W 2 , W 3 , Wi, we know 
that 

S(!;PFi) = 0 
S(vPf2) = 0 
SfyTFa) = 0 
SCsTTd = 0. 

If a curve of the type 

Y = a + bX + cX^ + dX^ 

has been fitted, this means that 

S(t>) = 0 
S(?jZ) = 0 
2(t)Z2) = 0 
= 0. 

The accuracy of the work may be tested by cheeking these 
relations. 

Finally, we may test the accuracy of the work by com- 
puting the standard error of estimate in two different ways. 
We may compute the separate residuals by taking the 
difference between computed and actual values of the 
dependent variable, and from these values determine S. 
This may be compared with the results secured by applying 
the general formula for the standard error, as derived above. 
In the fitting of the second degree curve, the data of which 
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were used to illustrate the method of checking the normal 
equations, the equation derived was 

F = - .92860 + 3.52316X - .267316X>*. 

From the residuals separately computed, we have 

= .4941. 

From the formula 

„ , S(F 2 ) - a2(F) - 5S(ZF) - cSCZ'F) 

Oy- - ^ J 

we have 

Sy = .4947. 

This constitutes a final check upon the accuracy of the 
calculations. 

SIMPLIFICATION OF NORMAL EQUATIONS IN A MULTIPLE 
■ CORRELATION PROBLEM* 

In the discussion of multiple correlation procedure in 
Chapter XVI the normal equations as first derived in the 
form 

I S(Xi) = JVa + 5 i 5 .,, 4 S(Z 2 ) + 6 i 3 .s 4 S(X 2 ) + 

II SCZiXj) = aS(Z 2 ) + 6i2.,4S(Z2>*) + 

+ 5u.2,2S(Z2Z4) 

III SfZlZs) = aS(Z3) + 6 i2.34S(Z2Z3) + 6i3.24S(Z22) 

-{■buas^iXsXi) 

IV S(ZlZ4) - aS(Z4) + 6 i2.342(Z2Z4) + 6i,,.24S(Z3Z4) 

+ 5i4.232(Z4®) 

were reduced in number and modified to facilitate their 
solution. Details of the method are here given. 

Letting A 1 , A 2 , A 3 , and A 4 represent the arithmetic 
means of the several variables, and Xi, x^, Xs, and X 4 represent 
deviations from the means, we may replace the variables 

^ Adapted from H. R. Tolley and M. J. B. Ezekiel, “A Method of Handling 
Multiple Correlation Problems,*’ Journal of the American Statistical Associa- 
tion, Vol. 18, 993-1003. 
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Ji, X2, As, and X, by their equivalents a:i + Ai, -j- As, 
X3 + As, Xi + A4. The normal equations now become: 

I ^(xi + Ai) — Na + S{r2 + As) - 612.34 + Z{x 3 +• A3) • 613.24 
-|- S(X4 + A 4 ) • 6i4.23 

II 2 )[(a:i + Ai)(x 2 + As)] = Stfe + As) • a + S(x2 + A2)'] ■ 612.34 

A S[(x2 A AslCxs A As)] • 613.24 
A S(x2 A A2 )(x4 a A4) • 614.23 

III S[(xi A Ai)(x3 a As)] = S(x3 A As) - a 

A S[(x3 A A site A A2)] • 612.34 A 2 (x 3 A A 3)^ • 613.24 
A S[(X3 A A3)(X4 a A4)] • 6 i 4.23 

IV S[(xi A Ai)(x4 a a 4)] = S(X4 A A4) • a 

A 2[(x4 a A4)(x2 a As)] • 612.34 

A S[(x4 A A4 )(xs a As)] • 613.24 A S(x4 A A4)® • 614.23. 

Since 2 (xi A Ai) = 2 xi A A^Ai, and since hxi = 0 , S(xi A Ai) 
and all similar expressions may be replaced by NAi, NAi, etc. 

If we expand S(x2 A As)^ to S(x2^ A 2A2X2 A As^), the 
middle term drops out, because 2x2 = 0, and the expression 
may be written 2x2^ A NA^^. The sums of all similar 
squares may be put in similar form. 

The product sum 2 (xi A Ai)(x2 A A2) = 2(xiX2 A A1X2 
A A2X1 A A1A2) = 2x1X2 A NAiAi since 2 xi = 0 and 2x2 
= 0. Product sums of the same type m%y be similarly 
modified. The normal equations now take the form: 

I NAi = Nci A iVA 26 i 2.34 A iV’A 3613.24 "f" iV^A 4614.2s 

II 2 (xiX 2 ) a iVAiAa = NA,.a A [ 2 (x 2 )<A NAi^]hi.si 

A [2(X2X3) a iVAsA 3 ] 6 i 3.24 a [S(X2X4) a XAiAi^liZZ 

III 2 (xiX 3 ) a NAiAs = NAsa A PfxaXs) A NAzAshzM 

A [ 2 (X 3 )^ a iVA 3 '] 6 l 3.24 + [ 2 (X 3 X 4 ) a JVA 3 A 4 ] 6 i 4.23 

IV 2 (xiX 4 ) a NAiAt = N A, a + [ 2 (x 2 X 4 ) A i^A 2 A 4 ] 6 i 2.34 

A [ 2 (x 3 X 4 ) a NAzAilhsM + [2(2:4)^ + NAm,.n. 

If we now divide through by JV, and substitute pi2 for 
0-2® for and similar symbols for other mean 

products and mean squares, the normal equations become 
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I Ai = c + A2&12.34 + + A 4614.23 

11 P12 + A1A2 = + ^2^)612.34 + (P 23 + A 2 A 3 ) 6 i 3,24 

+ (p 24 + 42^4)614.23 

III Viz + A1A3 = AzO, + (P 23 + 424.3)612.34 + (ffs^ + 43^)613.24 

-(- (p 34 + 4344)614.23 

IV Pl 4 + 4 i 44 = 44a + (p 24 + 4244)612.34 + (p 34 + 4344)613,24 

+ ( 0 ’ 4 ^ + 44^)614.23- 

These four simultaneous equations may now be reduced 
to three. We multiply equation I, throughout, by 42, 
and subtract the result from equation II ; we then multiply 
equation I by As, and subtract the result from equation III; 
we then multiply equation I by 44, and subtract the result 
from equation IV. All the terms containing 4 ’s are thus 
eliminated and we obtain the three normal equations 

P12 = 0 ' 2 ^ 6 i 2.34 + P23613.24 + P24614.23 
Pi3 = P23612.34 + cr 3^613.24 + P34614.23 

Pl 4 = P24612.34 + P34613.24 + 0 ' 4 ^ 6 i 4 . 23 - 

Inserting the observed values of the p’s and the o-’s, these 
are solved for the coefficients 6. The value a may then be 
obtained by inserting the values of the 4 ’s and the 6’s 
in the equation 

41 = a + 42612.34 + 43613.24 + 44614.23. 

SOLUTION OF THE NORMAL EQUATIONS 

The task of solving the normal equations is not a difficult 
one in most of the cases presented to the economic statis- 
tician. If there are only two or three unknowns the corre- 
sponding number of normal equations may be solved by 
simple algebraic methods. Even with three equations, 
however, it is advisable to employ a systematic procedure, 
and with more than three equations this is imperative. 
Such systematic methods of solving the simultaneous equa- 
tions which are met with in connection with the method 
of least squares have been worked out by Gauss and by 
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Doolittle. The latter method, which is perhaps the more 
convenient for general usage, is demonstrated below. 

The coefficients of the unknowns in the normal equations 
are always symmetrical with respect to the principal diago- 
nal. Thus in securing the most probable values of the 
constants in the equation . 

F = aWi -j- bW2 + cWz + dW 
we have the four normal equations 

a2{Wi^) + bXiWxWi) + cSCPFiTFa) + dS(WiWd - S(TFiF) = 0 
aS(TFiTF2) -1- 6S(TF2^) + cZiWzWz) + d'ZiWiWi) = 0 

aSiWzWz) -b bSiWzWz) + c2(Wz^) -b d[S^{WzWi) - SClFsF) = 0 
aZiWJVd -b bSiWiWi) -b cSCTFsFi) -b ^2(^4^) - 2(TF4F) = 0 

The ssunmetrical arrangement about the diagonal, when 
F-terms are neglected, is obvious. Starting with any term 
on the principal diagonal, we have the same coefficients 
directly above as to the left. Thus, above the diagonal 
teim in which the coefficient 2 ( 1 ^ 3 ^) appears, we have the 
coefficients SCPTaTFs) and SCTFiTFs). The same coefficients 
are found to the left of the given diagonal term, and on 
the same line. For the purposes of solution, therefore, 
the terms to the left of each diagonal entry may be omitted, 
and we may put the remaining terms of the nbrmal equations 
in the form 

dZ{W^) -b 62 (FiTF2) + c2(FiTF 8) + d2(FiF4) - S(FiF) 

+ 6S(F2^) + cZiWzWz) -b d-ZiWzW,) - SfFjF) 

+ c2(F30 + dSiWzW^) - S(F3F) 

-b d2(W) - S(F4F). 

THE DOOLITTLE METHOD 

The Doolittle method may be illustrated with reference 
to the following normal equations: 

8.3564a -b 2.7906 -b 2.932c -b 47.967 = 0 
2.790a + 6.66456 -b 2.063c -b 62.039 = 0 
2 . 932a -b 2 . 0636 -b 7 . 7893c + 47 . 519 = 0. 
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Putting these, for the purposes of the solution, in the 
abbreviated form given above, we have 

8.3564a + 2.7906 + 2.932c + 47.967 
+ 6.66456 + 2.063c + 62.039 
+ 7.7893c + 47.519. 

We wish to solve these for the constants a, 6, and c. All 
the work of computation, with the necessary checks, is 
shown in the following table : 


Table B 


Solution of Normal Equations by the Doolittle Method 



{!) 

(2) 

(3) 

(4) 

(5) 

(6) 

Line 

Eeciprocals 

a 

6 

C 


8 

I 


8.3564 

2.790 

2.932 

47.967 

62.0454 

II 



6.6645 

2.063 

62.039 

73.5565 

III 




7. 7893 

47.519 

60.3033 

1 


S. 35640 

2.790 

2.932 

47.967 

62.0454 

2 

“ .11966876 

“ 1.00000 

“ .333876 

- .350869 

~ 5.740151 

“ 7 . 424896 check 

3 



6.6645 

2.063 

62.039 i 

73.5565 

4 



“ .931514 

” . 978924 

“ 16.015030 1 

“ 20.715470 

5 



5.732986 

1.084076 

46.023970 

52.841030 check 

6 

“ .17442917 


~ 1.000000 

“ . 189094 

“ 8.027923 

~ 9.217017 check 

7 




7.7893 

47.519 

60.3033 

8 




” 1.028748 

“ 16.830133 

~ 21.769807 

9 




~ .204992 

“ S. 702857 

“ 9.991922 

10 




6.555560 

21.986010 

28 541571 check 

11 

” .15254227 



~ 1. 000000 

“ 3.353798 

“ 4.353796 check 


Back Solution 

C b’ a 

“ 3.353796 “ 8,027923 “ 5.740IS1 

“ 3 35379Q + ,634183 2 , 468592 

“ 7.393740 + 1 . 1 76743 

” 27094816 

a= - 2.094816 
6 = - 7.393740 
c=- 3.353796 

Check: 

Equation I : 

8.35640 + 2.7906 + 2.932c = - 47.967. 
Substituting the given values, 

8.3564(- 2.094816) + 2. 790(- 7.393740)^^^^^^^^^^ 

+ 2.932(- 3.353796) =— 47.966985. 
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Expldfic^tion . — The coefficients of the unknown quan- 
tities, o, fe, and c, are listed in the designated columns. 
The known term in each normal equation is listed in col- 
umn. (5) . (The sign of this known term, it should be noted, 
is that which it w’^ould have when the entire expression, of 
which it is one term, is equated to zero,) Column s is 
employed as a check. The value in column: sy in each of 
the lines I, II, and III, is the algebraic sum of the known 
values in the given normal equation. In securing this 
sum the coefficients to the left of the diagonal, which have 
been omitted from the table as it stands, must be included. 

^ ■ The following is a summary of - the procedure in solving 
thenormalequations: 

, 1. In line (1) write normal equation I. 

2. In line. (2), column. (1), write the reciprocal of the value in 
line (1), column (2), loith sign changed. (Thisis the reciprocal of the 
coefficient of a.) Multiply each item in line (1) by this reciprocal, 
entering the products in the corresponding .coliiniiis, in line (2), 
[The algebraic, sum of the items in .columns (2), (3), (4), and .(5) .of, 
.line (2) should equal the value in: column (6).] .This operation has 
eliminated the unknown a, ,by expressing it in terms of. h and ,c. 
[The , 1 in line (2), column, (2), has: been included only to facili- 
tate the. checking process.. The' same' is .true in lines (6) and (11),.] 
A heavy line may be drawn across,' the'table below line ,(2). 

3. Write normal equation II in line (3) . ® 

4. Multiply by .the coefficient of 6- in line. (2),. - .333876) 

the items in columns (3), (4),. '(5).,. and (6) in line (1). Enter the: 
products in the corresponding columns of line (4)..,. 

5. Add lines (3) and '(4),: entering' the sums in line .(5). .. . [The 
algebraic sum of the items. in columns (3),. (4), and (5) ■ of liiie^ 
(5) should equal the value ill column (6) .] , 

6. In column (1), line. (6), .enter^ the reciprocal of. the value in 

column (3), line (5), rmrsmg the sign. ■ Multiply each term in line 
(5) by this reciprocal, entering the products in line (6). [The siiiri 
of the items in columns (3), (4), and (5) of line (6) should equal the 
value in column (6).] This operation has the unknown 6, 

by expressing it in terms: of c. - A heavy line... may be drawn acxoss 
the table below line (6). ■ 

7. Write normal equation III in line (7). 

8. Multiply by the: coefficient; of n in-line (2) (i.e., “ .350869) 
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the items in columns (4), (5), and (6) of line (1). Enter the products 
in the corresponding columns of line (8). 

9. Multiply by the coefficient of c in line (6) (i.e., — .189094) 
the items in columns (4), (5), and (6) of line (5). Enter the products 
in the corresponding columns of line (9). 

10. Add lines (7), (8), and (9), entering the sums in line (10). 
[The algebraic sum of the items in columns (4) and (5) of line 

(10) should equal the value in column (6).] 

11. In column (1), line (11), enter the reciprocal of the value in 
column (4) of line (10), reversing the sign. Multiply each term in 
line (10) by this reciprocal, entering the products in line (11). 
[The algebraic sum of the items in columns (4) and (5) of line 

(11) should equal the value in column (6).] This operation gives 
the value of c, which is found in column (5) of line (11). A heavy 
line may be drawn across the table below line (11). 

Were there additional unknowns, as d and e, this last 
operation would have given c as a function of d and e and 
it would be necessary to carry the process still further, 
repeating the steps taken above. The next operation would 
be to bring down the fourth normal equation, entering it 
in line (12). Then the coefficients of d in lines (2), (6), and 
(11) would be used to multiply the necessary items in 
lines (1), (5), and (10), the products being entered in lines 
(13), (14), and (15). The sum of the items in lines (12), 
(13), (14), and' (15) would be entered in line (16) and 
checked by the item in the s column. Multiplying through 
by the reciprocal of the coefficient of d in line (16), with 
sign reversed, the value of d would be obtained in terms 
of e. The value of e would be derived in a similar fashion. 

The checks on these various operations have been indi- 
cated in the table. The testing of the results at each step 
reduces the possibility of error to a minimum. 

The back solution presents no difficulties. We have, 
from line (11), 

from line (6) 
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from line (2) 

a = - .3338765 - .350869c - 5.740151. 

[The items in column (6) are inserted merely as checks. 
The items - 1 . 000000 which appear in lines (2), (6), and 
(11) are inserted to assist in the checking.] 

The computatior^ involved in the back solution appear 
in the table. 

A final check is afforded by inserting the values secured 
by this process in one of the normal equations. This check, 
as carried out for equation I, is shown below the table. 
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APPENDIX B 


DERIVATION OF FORMULAS FOR MEAN AND 
STANDARD DEVIATION OF THE BINOMIAL 
DISTRIBUTION! 


For convenience we put the binomial in the form (q + p) “ 
where q = probability of a failure, p = probability of a 
success, and q + p = 1. Expanding the binomial, we have 

(q + p)" == 5“ + 4- ^ - 


+ 


n(n — l)(n — 2) 
1 • 2 -3 


qn-ipi _j_ _ _ . + P”. 


The terms of this expansion indicate, in order, the probable 
frequencies of no successes, 1 success, 2 successes, 3 suc- 
cesses, and so on, to n successes. A frequency table of the 
familiar type may be constructed from these materials. 

The items in col. (2) of Table C constitute the terms of 
the binomial expansion. Their sum is thus equal to (q -t- pY, 
which is, by definition, equal to 1. The items in col. (3), 
added in order, give 

+ *-^2-^"-') ^-’’“ + ■ • ■ + "f- 

Since the factors n and p appear in each of these terms, this 
reduces to 

I 

^ These derivations are adapted from the proof given by I). C, Jones in 
A First Course in Statistics, London, Bell & Sons, 1921, 143-145. 
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But the terms within brackets, following np, represent the 
expansion of the binomial (q + Since q + p - 1, 

the sum of these terms is 1. Accordingly the sum of the 
items in col. (3) reduces to 

np{q + p)"“* = np. 

For the mean of this distribution we have 


M 2(/) 1 


Adding the items in col. (4) in order, we have 
nqn-tpi ^ 2n{n — l)q^-^p^ + — — — ^ q”~^p^ 


+ 


M. - 1)(" - 3) ,.-y + . . . + „y. 


np 


_|_ 2(n — q^-^p^ 

qn-ipi _|_ _ _ _j_ jjptl-l 


+ 


4(n - l)(n - 2)(n - 3) . 


1 


The terms within brackets may be broken into two groups, 
giving 

npj^ I q”~^ + (n — — q'^^p^ 

+ ( (» - DrY + 


I 3(n - l)(n - 2)(n - 3) , 

+ 1.2-3 9 P + 


+ (n — Up'*”^ 


The terms within the first of these two groups constitute 
the expansion of the binomial (g + p)’*~^ These terms 
may be replaced by that binomial; the second group of 
terms may be simplified, since they contain the common 
factors n — 1 and p. These operations give us 
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+ P)"”* + “ 1)P I S'”"* + (n — 2)g"-8p' 

+ g-«-4p2 -I- . . . pn-2 1 j. 

The second group of terms, thus simplified, is seen to be 
(n - l)p multiplied by the expansion of the binomial 
Iq + Thus we have, as the sum of the items in 

col. (4) of the preceding table, 

npiiq + p)"~* + (n - l)p(q + p)”-®]. 

But since ? + P = 1, (? + p)“~^ = 1 and {q + p)”-^ = 1. 
Accordingly, the total of col. (4) becomes 

np[l + p(n - 1)]. 

As a general formula for the standard deviation, in 
squared form, we have 



where c is the difference between the mean of the distribu- 
tion and the arbitrary origin. In the present instance, 
the origin is at 0, or “no successes,” and c is equal to the 
mean, or np. N is equal to S(/), or 1, in this case. Thus 
the standard deviation of the binomial distributions given by 

t 

0-2 — ^ _ 1)] ... ^2p2 

= nq>[np + (1 p)] — 

= n^p^ + np{l — p) — ?i2p2 

== np(l — p) 

= npq 

or = s/npq. : ' , ■ 
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DERIVATION OF THE STANDARD ERROR 
OF THE ARITHMETIC MEAN 

We have made n random, hence independent, observations 
on a given variable. The respective observations may be rep- 
resented by Xi, Z 2 , Xa . . . X„. Representing the sum of 
the ?i observations by W, we have 

W = X, + X2 + Xa + . . . + Xr.. (1) 

Additional samples are now taken until we have N values of 
Xi, N values of X 2 , etc., and hence iV values of the sum, W. 
We have N samples, therefore, of n observations each. The 
mean values, which we may represent by barred letters, 
stand in the same relationship of equality: 

TF — A 1 -|- A 2 *4" As * *4“ An. (2) 

Using small letters {w, Xi, X 2 , etc.) to define deviations of 
the actual observations from these mean values, we may 
write, for any given sample, or series of observations, 

= Xi 4" X2 4~ X3 4- ■ • . 4” Xn- (3) 

Squaring the two sides of this equation, we have 

= Xi? + xi + xz^ . . .4- xi 4- 2x1X2 4 - 2xiX3 4- ... 

4“ 2XiXn 4" 2X2X3 4~ • ■ • 4” 2X2Xn 4~ • • . 

4- 2x3X„ 4- • • . • (4) 

Each term on the right-hand side of (3) will appear in squared 
form in (4), and there will also appear product terms of the 
form 2 x 1 X 2 corresponding to all possible pairings of the terms 
on the right-hand side. 

The next step involves the summation of the equations 
of type (4), derived from the N samples, and division 
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throughout by N. Each product term, when thus summed 
and divided by N, will be of the form 

211xiXi 


This, with the modification introduced by the factor 2, re- 
sembles the familiar mean product, encountered in cor- 
relation procedure. This mean product, we have seen, has a 
value of zero when the variables x and y are uncorrelated. 
But, by hypothesis, the observations that have given us 
xi, Xi, xs, etc., are independent of one another, and hence 
these variables are uncorrelated. Accordingly, each of the 
product terms, derived when N equations corresponding 
to (4) above are summed and divided by N, is equal to zero. 
The process of summation and division gives us, therefore. 


la’W- 

w 


Lii-i- 


4- 


’^X2~ 


+ 


N 


+ • • • + 


IT 


(5) 


or 

cTw- = cr{^ + a 2^ - 1 - 0-3' + . . . + cr„l ( 6 ) 

If all the observations relate to the same universe (i.e., if 
the samples are all drawn from the same parent population) , 
which is true, by hypothesis, the standard deviations appear- 
mg in the right-hand member of equation (6) are equal to 
one another and to the standard deviation of the population. 
Accordingly, using <r to represent that standard deviation, 
we have 


aj = rur. (7) 

The next argument, that leads directly to the desired 
measirrement, follows precisely these steps, which have been 
given in the above form to indicate the reasoning involved. 
It starts, however, with a variant form of equation (3). 
Dividing that equation throughout by n, we have 


® — El 4- E? -L. 5? j- _|- E". 
n n n n ' ‘ ' n 


( 8 ) 
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^IV 3^1 00^ 

Working with the variables , etc., just as we have 

Tb ^ Yh 

done with w, Xi, X 2 , etc., we may go through the operations 
represented by equations (4), (5), and (6), above. The 
product terms disappear, as in passing from (4) to (5). In 


the process of squaring 


•fl-kCk •frir’rvT* 


w 


the sum of the squared values is thus S 



2 

I • Numerator and 


denominator of each of the terms of type are squared 

separately, however, and the sum is of the form — r- Division 

throughout by iV then gives the quantities appearing in equa- 
tion (9), which corresponds to equation (6). 


n 


^ -1- fj! j_ 
n® 


-i- . . . -i- 


(9) 


Since all observations relate to the same universe, this re- 
duces to 



( 10 ) 


From this 



( 11 ) 


But w is the sum of n observations drawn from a universe 

“VO 

having a standard deviation of a, and - is the mean of these 

n 

observations, era is the standard deviation of a distribution 

n 

of arithmetic means, corresponding to the familiar symbol 
(tm- This is the desired expression for the standard error 
of the arithmetic mean, appropriate for use when the <t 
of the population is known. Where or is estimated from the 
standard deviation of a sample, accuracy is increased by 
using Vn — 1 rather than Vn in the denominator of the 
right-hand member of (11). 
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ILLUSTRATING THE MEASUREMENT OF TREND 
BY A MODIFIED EXPONENTIAL CURVE, A GOM- 
PERTZ CURVE AND A LOGISTIC CURVE 

The discussion in Chapter VII of mathematical functions 
suitable for use in measuring the secular trends of time series 
dealt with types required in ordinary practice. We here 
discuss briefly three other types suited to the measurement 
of long-term movements in economic and business series. 

The Modified Exponential Curve 
An exponential curve, which plots as a straight line on 
ratio paper, is a suitable measure of trend for a series that 
is increasing or decreasing at a constant rate, that is, one 
that shows constancy of relative growth. The figures defin- 
ing the successive trend values of a series of this type con- 
stitute a geometric progression. The trends of certain eco- 
nomic series that depart from constancy of relative growth 
may be accurately defined by a simple modification of the 
exponential curve. This is the case when the observed 
values may be transformed, by the addition (or subtraction) 
a constant magnitude, to a series closely approximating 
such a geometric progression. 

If we represent by K the constant magnitude that is to 
be added (algebraically) to each observed value in effecting 
the desired transformation, the task of fitting the trend line 
involves the following steps : 

Determination of A. 

Correction of observed values by A, to obtain the modified series. 
Fitting an exponential curve to the modified series, and computa- 
tion of trend values of the modified series. 
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Correction of trend values of the modified series by K to obtain 
ti'end values of original series. 

If y represents the ordinates of trend of the original series 
and X represents time, the equation to the desired line of 
trend may be put in the form 

y — ab^ — K 

where K is the correction factor noted above and a and b 
are constants to be determined by fitting an exponential 
curve to the modified series. The procedm-e may be illus- 
trated with reference to the data in Table D. 

Table D 

Illustrating the Fitting of a Modified Exponential Curve 
Production of Rayon Filament Yarn in the United States, 
1920-1931 1 

(Data in thousands of pounds) 


(1) 

(2) 

(3) 


(4) 

(5) 

(6) 

Year 

Original 

series 

(obsermd) 

Group 

mean 


Modified 

series 

Trend vaiues, 
modified 
series 

Trend values^ 
original 
series 





(2) + K 


(5) K 

1920 

10,125 



27,669 

29,108 

11,564 

1921 

14,986 

Ml = 


32,530 

34,363 

16,819 

1922 

24,067 

, 21,034. 

25 

41,611 

40,565 

23,021 

1923.. 

34,959 



52,503 

47,888 

30,344 

1924' 

36,328 



53,872 

56,532 

38,988 

1925 

51,049 

,M2 = 


68,593 

66,736 

49,192 

1926 

62,693 

56,406. 

25 

80,237 

78,782 

61,238 

1927 

75,555 



93,099 

93,003 

: 75,459, , 

1928 

97,232 



114,776 

109,790 

92,246 

1929 

121,399 

Ms- 


138,943 

129,608 

112,064 

1930 

127,333 

124,210. 

75 

144,877 

153,003 

■ 135,459 

1931 

150,879 



168,423 

180,621 

>163,077 


In employing this method we approximate K empirically 
by breaking the observed series into three parts, represent- 
ing equal periods of time, and determining the mean of the 

^ Data from Textile Economics Bureau. 
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observations for each period. We may designate these means, 
in chronological order, by Mi, M2, and Ms. The desired 
value, if, is given by 

K = [Mi^ - (Ml X Ms)] - 5 - [(Ml + Ms) - 2M2I 

If the observed series constitute a geometric progression 
the value of K will be zero; if the addition of a constant 
magnitude to the members of the original series will yield 
a series approximating a geometric progression, K will be 
positive; if the subtraction of a constant amount from the 
observed values will jdeld a series approximating a geometric 
progression, K will be negative. (In practice, K is given 
the sign obtained by the employment of the method de- 
scribed above, and then added algebraically to the observed 
series.) 

In the present ease w'e have 

K = [(56,406.25)= - (21,034.25 X 124,210.75)] 

^ [(21,034.25 -f 124,210.75) - (2 X 56,406.25)] 

= -b 17,544. 

Adding this amount to each of the values recorded in 
col. (2) of Table D, we obtain the modified series in col. (4). 
In fitting an exponential curve to the modified series, it is 
desirable to use logarithms, that is, to solve the constants 
in an equation of the type logy = logo ■+■ (log6)a:. This 
procedure was explained in Chapter VII. For log a of this 
curve we obtain 4 . 824359, and for log h . 072068. (The origin 
is at 1925.) The antilogarithms of the series of trend values 
thus obtained are given in col. (5). These define the trend of 
the modified series. Subtracting K (algebraically) from these 
values we obtain the trend values of the original series, which 
appear in col. (6). 

The original series measuring production of rayon fila- 
ment yam and the modified exponential curve fitted to this 
series are shown graphically in Fig. A. 

It is essential that the three M’s used in the determina- 
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of L L.. 1 L ..i ■■■!■- J 

1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 


Year 

Fig, A. — Total Production of Rayon Filament Yarn in the United States, 
1920-1931, with Modified Exponential Trend 

tion of K relate to equal numbers of observations and that 
the midpoints, in time, of the three periods be equidistant. 
In the above example the number of years included in the 
period is a multiple of three, and no difficulty arises. If the 
number of years included is not a multiple of three, intervals 
that overlap slightly may be employed. For example, if our 
series had rim from 1920 to 1932, the three averages might 
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have been derived, respectively, from the five-year periods 
1920-1924, 1924-1928, 1928-1932. These would center, re- 
spectively, at 1922, 1926, and 1930, and would thus be equi- 
distant in time from one another. Alternatively, if monthly 
data are available, division of the total period into three 
equal parts may be facilitated by using a time-unit of 4 or 
8 months, rather than 12 months. 

The Gompertz Curve 

The Gompertz curve, which has important uses in actu- 
arial science, has had some application in the study of eco- 
nomic and business trends. The term “growth curve” is 
applicable to it, since it portrays a process of cumulative 
expansion to a maximum value. This expansion proceeds 
by decreasing absolute increments in the later stages, but 
continues to the end without retrogression. It may not be 
assmned that this form of growth is typical of all industrial 
development, but the curve has value as an empirical rep- 
resentation of certain trend movements. 

For the purpose of fitting, the equation to the curve is 
transformed from the natural form 

y = ah'^ 

to the logarithmic form * 

log ?/ = log a -f (log 

When fitted to an appropriate set of observations, measur- 
ing the expansion of an industry or the growth of an eco- 
nomic element, log a is the logarithm of the maximum value 
— the ceiling that the curve approaches. The second term 
measures the amount by which the trend value at a given 
time falls short of this maximum, an amount that diminishes, 
of course, with the passage of time. (The series for which 
this curve is an appropriate measure of trend will be ex- 
panding by decreasing amounts in the later stages of the 
period covered, and c, derived in the manner indicated below, 
will have a value between zero and unity.) The origin on 
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the a:-scale (time) is taken at the year to which the first 
entry relates. 

The method employed in fitting this curve is an approxi- 
mative one, since the least squares procedure in customary 
form is not applicable. Here, as in the preceding example, 
the series is broken into thi-ee equal portions. The sum of 
the logarithms of the observations in each of these segments 
is obtained; from these sums, and the differences between 
them, the necessary constants may be computed. The 
method is illustrated with reference to the data of rayon pro- 
duction for the years 1920-1937, w’-hich appear in Table E. 

Table E 

Computation of Quaniities Required in the Fitting of a 
Gompertz Curve 

Production of Rayon Filament Yarn in the United States^ 1920-1937 
(Data in thousands of pounds) 



Rayon 



First 

Year 

pro(hictmi 

y 

Log y 

Suh-totak 

differences 

1920 

10,125 

4.0053950 



1921 

14,986 

4. 1756857 



1922 

24,067 

4.3814220 



1923 

34,959 

4.5435590 



1924 

36,328 ' 

4.5602415 

26.374290 


1925 

51,049 

4.7079872 



1926 

62,693 

4.7972191 


di 82 -Si 

1927 

75,555 

4.8782632 


= 3.666786, 

1928 

97,232 

4.9878092 



1929 

121,399 

5.0842151 



1930 

127,333 

5. 1049409 

30.031076 


1931 

150,879 

5. 1786288 



1932 

134,670 

5. 1292709 



1933 

213,498 

5.3293938 


=2.095138, 

1934 

208,321 

5.3187331 



1935 

257,557 

5.4108734 



1936 

277,626 

5.4434601 

, - . ■32.126214 ; 


1937 

312,236 

5.4944829 





88.5315830 
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We may use n to define the number of terms entering 
into each of the three sub-totals (in the present example 
^ = 6); the sub-totals are represented, in chronological or- 
der, by S 2 , and Ss', the first differences between the sub- 
totals are represented by di and da- ^ We use these quantities 
in solving for the three constants c, log& and log a. The 
general relations from which these values are determined 
are the following: 


log I) 


log a 


^2 

di 

di(c 


1) 


(c" - 1)2 

i ( Sx 

n i 


di 


c” — 1 

Inserting the proper quantities, we have 




2,095138 

- = . 0/2945 


>.656786 


c 


log b 


log a 


V. 572945 = 
3.656786 X 


.911351 
- .088649 


(.572945 - 1)2 


1.777493 


26.374290 
6 I 


3.656786 


.572945 - 

= 5.822848. 

The required equation is, therefore, 

logy = 5.822848 - 1 .777493(.91135U) 

in which a: relates to delations from an origin at the position 
of the first term. 

Substituting in this trend equation the values of x given 
in Table F, logarithms of the trend values are obtained. The 
corresponding natural numbers define the course of the line 
of trend. The method of calculation is indicated in Table F. 

^ The condition, previously noted, that the series to which the curve is to 
be fitted be one that is expanding by decreasing increments in the later 
stages of tlie period cf)vered, is met when dz is less than di. 
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Table F 


Illustrating the Computation of Ordinates of Trend of a Gompertz 
Curve Fitted to Data of Rayon Production, 1920-1937 


(1) 

,(2) 

(3) 

(4) 

(5) 

(6) 

Year 

■' X 

c® 

(log h)(f 

logy 

(4) + log a 

y 

Anti-log of (5) 
(to thousands 






of pounds) 

1920 

0 

1.000000 

- 1.777493 

4.045355 

11,101 

1921 

1 

0.911351 

- 1.619920 

4.202928 

15,956 

1922 

2 

0.830560 

-- 1.476315 

4.346533 

22,209 

1923 

3 

0.756932 

- 1.345441 

4.477407 

30,020 

1924 

4 

0.689830 

- 1.226168 

4.596680 

39,508 

1925 

5 

0.628677 

-- 1.117469 

4,705379 

50.743 

63.744 

1926 

6 

0.572945 

~ 1.018408 

4.804440 

1927 ' 

7 

0.522154 

~ 0.928125 

4.894723 

78,474 

1928 

8 

0.475865 

- 0.845847 

4.977001 

94,842 

1929 

9 

0.433681 

- 0.770865 

5.051983 

112,715 

1930 

10 

0.395235 

- 0.702527 

5.120321 

131,923 

1931 

11 

0.360198 

~ 0.640249 

5. 182599 

152,265 

1932 

12 

0.328267 

~ 0.583492 

5.239356 

173,523 

1933 

13 

0.299166 

~ 0.531765 

5.291083 

195,471 

1934 

14 

0.272645 

~ 0.484625 

5.338223 

217,883 

1935 

15 

0.248475 

- 0.441663 

5.381185 

240,539 

1936 

16 

0.226448 

- 0.402510 

5.420338 

263,231 

1937 

17 

0.206374 

- 0.366830 

5.456018 

285,771 

1947 

27 

0.081566 

0.144983 

5.677865 

476,283 

1957 

37 

0.032238 

- 0.057303 

5.765545 

582,834 

1967 

47 

0.012741 

- 0.022647 

5.800201 

631,245 


The original data and the Gompertz curve fitted to them 
are shown graphically in Fig. B. 

The ceiling to this curve is set by the constant a, which 
has a value of approximately 665,000,000 pounds. This 
indicates that if the extrapolation of the trend of rayon 
production from 1920 to 1937, as measured by a Gompertz 
curve, accurately defines the future course of production, 
the maximum output to be expected is 665 million pounds 
per year. (It need hardly be pointed out that this extra- 
polation involves some doubtful assumptions, and that no 
mystic significance is to be attached to it.) The years to 
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which the present data relate were years of rapid expansion 
in the industry. The slackening of the rate of increase, 
which is to be expected in a mature industry, had not be- 
come marked by 1937. In order that the nature of the curve 
may be clear, extrapolated values for 1947, 1957, and 1967 
are given in the table, and the projection of the trend is 



1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 


Fi< 3. B. — Total Production of Rayon Filament Yarn in the United States, 
1920-1937, with Gompertz Trend Line Extrapolated to 1967 

shown in Fig. B. After 1947, and still mote conspicuously 
after 1957, the curve shows a notable dampening in the rate 
of expansion. We may not say that the industry will actually 
follow this course. In particular, the asjnmptote a may be 
expected to change, as conditions affecting the industry 
and the demand for its products vary in the future. Within 
the limits of the observations, however, the Gompertz curve 
serves as a satisfactory measure of trend. 

The Logistic Curve 

The logistic curve, sometimes termed the Pearl-Reed 
growth curve because of the extensive use made of it in 
population studies by Raymond Pearl and L. J. Reed, 
resembles somewhat the Gompertz curve discussed above. 
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Table G 


Computation of Quantities Required in the Fitting of a Logistic Curve 
to Data of Railroad Mileage Operated in the United States, 
by Five-Year Intervals, 1850-1935 


( 1 ) 

( 2 ) 

Miles of 

(3) 

(4) 

(5) . 

Fsar, . 

railroad 

operated 

100,000,000 

y 

Sub-totals 

First 

differences 


y 




1850 

9,021 

11,085 



1855 

18,374 

5,442 



1860 

30,626 

3,265 



1866 

35,086 

2,850 

St = 25,882 


1870 

52,922 

1,890 



1875 

74,096 

1,350 



1880 

93,262 

1,072 


di = ~ 481 

1885 

128,320 

779 


= - 21,849 

1890 

156,404 

639 

St = 4,033 

1895 

177,746 

563 



1900 

192,556 

519 



1905 

216,974 

461 



1910 

240,831 

415 


^^2 — /S 3 — 1 S 2 

1915 

257,569 

388 


- - 1,679 

1920 

259,941 

385 

Ss = 2,354 

1925 

258,631 

387 



1930 

260,440 " 

384 



1935 

252,930 

395 




It represents a modified geometric progression, the growth 
of a series that tends to decrease as it approaches some 
specified limit. Like the Gompertz curve it may be used as 
an empirical approximation to the trends of certain economic 
series. Extrapolations are subject, of course, to the same 
uncertainties that attach to projections of other empirically 
derived trend lines. 

A form of this curve adapted to use as a measure of trend 
is defined by the equation 

- = d -f- be*. ^ 

y 
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This, it will be noted, is the equation to a modified exponen- 
tial curve, except that the dependent variable is rather 

y 

than y. (The symbols here used for the constants differ 
somewhat from those employed in treating the modified 
exponential curve.) A method of fitting somewhat similar 
to those employed in the preceding examples may be em- 
ployed, with necessary modifications required by the use of 
reciprocals of y. The method may be discussed with refer- 
ence to the data of railroad mileage in Table G. Compu- 
tations are facilitated by multiplying the reciprocals of y by 
a suitable power of 10, as is done in col. (3) of this table. 

As in the two preceding illustrations, the observations are 
divided, chronologically, into three equal groups. Group 
sub-totals and the first differences between these sub-totals 
are computed. The sjmibol n is used for the number of terms 
in each of these sub-groups. The origin of the a;-scale (time) 
is set at the date of the first observation. The time unit 
here employed is five years. 

The constants in the desired equation may be derived 
from the following relations. 



_ di(c — 1) 
- Jen - 1)2 



Substituting the given values, we have 

c = ^);076846 = .652034 


- 21,849( - .347966) 
“(. 071^46 - ly 


,+, 8,921 . 14 
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(. 0768 ^- 1 ) } =+ 369 . 04 . 

These results relate to initial observations which have been 
modified by the multiplication of ~ by 100,000,000. The 
desired equation is, therefore, 


100,000,000 

y 


= 369 . 04 + 8 , 921 . 14 ( . 652034 *) 


where x measures deviations in five-year units from an origin 
at 1850. 

Succeeding calculations are shown in Table H. 


Table H 

Computation of Ordinates of Trend of Logistic Curve Fitted to Data 


(1) 

(2) 

(3) 

Year 


c"* 

1850 

0 

1.000000 

1855 

1 

.652034 

1860 

2 

.425148 

1865 

3 

.27'7211 

1870 

" 4 

. 180751 

1875 

5 

.117856 

1880; 

6 

.076846 

1885 

7 

.050106 

1890 

8 

.032671 

1895 

9 

.021303 

1900 

10 

.013890 

1905 

11 

.009057 

1910 

12 

.005905 

.1915".^ 

13 

.003850 

1920 

14 

.002511 

1925 

■.:15: 

.001637 

1930 

16 

.001067 

1935 

17 

.000696 


of Railroad Mileage 


(4) 

(5) 

100,000,000 

h(f 

y 

{a + hc^) 

8,921 

9,290 

5,817 

6,186 

3,793 

4,162 

2,473 

2,842 

1,613 

1,982 

1,051 

1,420 

686 

1,055 

447 

816 

291 

660 

190 

559 

124 

493 

81 

450 

53 

422 

34 

403 

: ,22 / 

391 

15 

384 

10 

379 

6 

375 : . + 


( 6 ) 

y 

(100,000,000 X 

10,764 
16,166 
24,027 
35,186 
50,454 
' 70,423 

94,787 
122,549 
151,515 
178,891 . ; 

, 202,840 ^ 

■ 222,222 

236,967 
. 248,139 

..255,754 , . 
260,417 

. .. . . 263,852. , 

' . 266,667 . 
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I 

I The process of calculation is a straightforward one. The 
I reciprocals of the entries in col. (5), multiplied by 100, 000,- 
000, yield the desired trend values given in col. (6). These 
values, with the original series, are shown graphically in 
i Fig. C. 

I As in the case of the Gompertz curve, the logistic is suit- 
I able for measuring the trend of a series that, in its later 



Fig. C. — Railroac! Mileage Operated in the United States, by Five-Year 
Intervals, i850“'1935, with Logistic Trend 


|, 

I 

! 

I 


i 

I 

j 


stages, is growing by decreasing increments. The curve 
resembles an elongated S rising from a lower asymptote of 
zero to an upper l im it indicated by the constant a. Since a in 
this case refers to an equation in which the dependent vari- 


able is IMWOgO, asymptote is 12!W2» 

y a 


Prom the given value of a, 369.04, we derive 270,973 miles 
as the upper li mit of railroad noileage in the United States. 
As is clear from the table and chart, the actual values are 
close to this indicated limit. Barring the possibility of a 
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fundamental change in relevant conditions, the record and 
the curve fitted to it indicate that the era of railroad ex- 
pansion has ended. The extrapolation is, of course, subject 
to all the reservations that attach to the projection of other 
curves. There can be no doubt that, within the limits of 
the observations, the logistic curve gives an excellent rep- 
resentation of the actual history of railroad operation in the 
United States. 
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A FURTHER APPLICATION OF VARIANCE 
ANALYSIS 

The possibilities of Fisher’s method of variance analysis 
were far from exhausted by the several examples given in 
Chapter XV. We here supplement the treatment in that 
chapter by an additional example, illustrating further tests 
that may be made with a two-fold principle of classification. 

The observations on which this example is based consist 
of relative numbers, measuring the prices of 670 commodities 
in February, 1933, with average prices in 1926 taken as lOO. 
February, 1933 marked the low point of the severe price 
decline that began in 1929. The questions to which our 
tests are directed relate to the relative severity of the declines 
occurring among different classes of goods. 

The 670 price relatives (obtained from price quotations 
compiled by the IT. S. Bureau of Labor Statistics) may be 
classified into those relating to perishable goods (505 in 
number) and those relating to durable goods (l65 in number). 
The classification has economic significance because of differ- 
ences in the market conditions, on both supply and demand 
sides, affecting these classes of goods during a major reces- 
sion. Again, the 670 observations may be broken down into 
those relating to raw materials (134 in number) and those 
relating to manufactiu-ed goods (536 in number). Applying 
the two principles of classification jointly we obtain 4 sub- 
groups, perishable raw materials (101 in number), perish- 
able manufactured goods (404 in number), durable raw 
materials (33 in number) and durable manufactured goods 
(132 in number). It is to be noted that the ratio of the num- 
ber of perishable raw materials to the number of perishable 
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manufactured goods, 101:404, is the same as the ratio 
of the number of durable raw materials to the number of 
durable manufactured goods, 33:132. It is a necessary con- 
dition of the procedure here discussed that the frequencies 
in the several sub-groups be proportional. 

Various questions relating to the significance of these prin- 
ciples of classification may be answered with reference to the 
summary figures given in Table I. 


Table I 


Measurements Relating to the A^ialysis of the Relative Prices of 
of 670 Commodities for February^ 1933 
(1926 = 100) 


1 

Perishable 
rav) fnaterials 

Ni = 101 

Mj = 41.663366 
- 31,118.56 


2 

Perishable 
manufactured goods 

Aq = 404 
Ms - 62.329208 
S# = 187,414.21 


I 

All 

perishable goods 

N.p = 505 
Mp = 58.196040 
^d^ = 253,040.57 


3 

Durable 
raw materials 


4 

Durable 

manufactured goods 


II 

All 

durable goods 


Nz = 33 

Ms =- 65.060606 ' 
= 12,217.88 

A 

All 

■raw materials 

Nr = 134 
Mr = 47.425373 
= 56,952.76 


N4 = 132 
Ma = 75.719697 
S# = 31,308.63 

B 

All 

manufactured goods 

Nm = 536 

^ 65. 626866 
= 236,562.35 


Ni = 165 
Mi = 73.587879 
m = 46,525.97 


All 

commodities 

iV ='670 
M = 61.986567 
329,029.89 


The entries relating to each group and sub-group define 
the number of conamodities included, the mean value of the 
price relatives for February, 1933, and the sum of the 
squares of the deviations of the observations in that group 
from the mean of that group. Thiis for perishable raw ma.- 
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terials the mean, is 41.663366 (indicating an average price 
decline of 58.34 per cent) and the sum of the squares of the 
deviations of the 101 observations in this group from 
41.663366 is 31,118.56. For all commodities the mean is 
61.986567, and the sum of the squares of the deviations of 
the individual items from this mean is 329,029.89. 

A TEST OP THE PEEISHABLE-nUEABLE PBINCIPLB OP 
CLASSIFICATION 

We may first ask whether the application of the two 
basic principles of classification, considered separately, gives 
groups showing significant differences in their price changes 
from 1926 to February, 1933. Examining the results of the 
perishable-durable distinction, we note that durable goods, 
with an average of 73.587879, show smaller price declines 
than perishable goods, for which the average is 58.196040. 
(Six decimal places are retained in the averages because these 
figures enter into later calculations.) Is the difference sig- 
nificant, or may it be attributed to chance? A test of the 
type discussed in Chapter XV provides an answer to this 
question. For the application of the test we must divide the 
total variability, 329,029.89, into a portion unaffected by 
perishable-durable differences and a portion that may be 
attributed to the play of forces directly related to this dis- 
tinction. 

The first of these portions, measuring the variability 
within classes, is derived directly from the figures in Table I. 

Variability within perishable group 

= ScP for that group = 253,040.57 
Variability within durable group 

= Sd“ for that group = 46,525.97 
Total variability within classes 299,566.54 

In deriving a measure of the variability between classes 
we take the deviation of each class mean from the mean of 
all the observations, square this, and weight by the number 
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of observations in that class. Thus 

between perishable-durable classes 
= [(61 . 986567 - 58 . 196040)2 X 505] -b [(61 . 986567 - 73 . 587879)2 

X 165] = 29,463.31. 

A test of the significance of this classification reduces to 
the question whether the variability between classes is sig- 
nificantly greater than the variability within classes, when 
account has been taken of the number of degrees of freedom 
present in the tw'o measures of variability. The appropriate 
s-test is shown below. 


Naiim of 
variMUty 

Degrees of 
freedom 
n 

S um of 
squares 

Vanmiee 

<r2 

LoQe < 7 ^ 

BetAveen classes 

1 

29,463.31 

29,463.31 

10.290900 

Within classes 

668 

299,566.54 

448.45 

6. 105814 


669 

329,029.85 

Diff. 

= 4.185096 


3 = 2,09 

For rii = 1 and H 2 = 66<S the 1 per cent value of z is approx- 
imately .95; the present value is materially greater than 
this. The variance between classes is significantly greater 
than the variance within classes. The results are not con- 
sistent with the hypothesis that the true value of z is zero. 
There is a significant difference between the February, 1933, 
price relatives of perishable and durable goods, on the 1926 
base. This principle of classification is a significant one, 
with reference to this aspect of price behavior. 

A TEST OF THE RAW-MANITPACTURED PBINCIPLE OP 
CLASSIFICATION 

The test of the other main principle of classification follows 
exactly the same lines. The total variability, 329,029.89, 
is broken into a portion measuring variability within classes 
(293,515.11), with 668 degrees of freedom, and a portion 
measuring variability between the raw-manufactured classes 
(35,514.75) with 1 degree of freedom. The value of z is 
2.20; the corresponding 1 per cent value of z is .95. This 
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principle of classification, also, is significant. Raw and 
manufactured goods differed significantly in degree of price 

change between 1926 and February, 1933. 

A TEST OF THE RESULTS OBTAINED FROM THE JOINT APPLICA- 
TION OF THE PERISHABLE-DURABLE AND RAW-MANU- 
FACTURED PRINCIPLES OF CLASSIFICATION 

The application of the two principles of classification dis- 
cussed above yields the 4 cells shown in Table 1. We may- 
ask whether the four groups thus distinguished — perishable 
raw materials, perishable manufactured goods, durable raw 
materials, and durable manufactured goods — are signifi- 
cantly different, judged with reference to the present obser- 
vations. The two essential elements of the total variability 
are derived from the figures in Table I in the manner indi- 
cated below. 

Variability within perishable raw materials group = .31,118,56 
Variability within perishable manufactures group = 187,414.21 
Variability within durable raw materials group = 12,217.88 
Variability within durable manufactures group = 31,308.63 
Total variability within cells 262,059.28 

This sum furnishes the yardstick that is used in the tests 
that follow. It is clear that it represents the action of forces 
other than those related to relative durability, or to degx'ce 
of fabrication. For its four elements measure variability 
among commodities that are alike in respect of durability 
and alike in respect of degree of fabrication. ^ This sum is a 
measure of the strength of the forces we lump together 
as chance, which here means all factors affecting our observa- 
tions other than those related to the relative durability of 
commodities or to degree of fabrication of commodities. 

* This statement may be accepted as accurate for the purpose of the present 
demonstration. Actually, of course, the distinctions between perisliable and 
durable commodities and botwoen raw and manufactured goods are not clear- 
cut and definitive. 
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A measure of variability between cells is derived as in the 
previous examples. 

between cells = [(61.986567 — 41.663366)* X 101] 

+ [(61 . 986567 - 62 . 329208)* X 404] 
+ [(61 . 986567 - 65 . 060606)* X 33] 
+ [(61.986567 - 75.719697)* X 132] 
= 66,970.60. 

The test of significance takes the following form. 


Nature of 
variabilikj 

Degrees of 
freedom 
n 

Sum of 
squares 

Variance 

LOQe Cr* 

Between cells 

3 

66,970.60 

22,323.53 

10.013395 

Within cells 

666 

262,059.28 

393.48 

5.975035 


669 

329,029.88 

Diff. 

= 4.038360 


a = 2.02 

For rii = 3, 712 = 666 the 1 per cent value of z is approxi- 
mately .67. The present value materially exceeds this. 
The conclusion is clear that the joint application of the two 
principles of classification yields sub-groups which differed 
significantly in their price movements between 1926 and 
February, 1933. 

FORTHEB TESTS OF THE MAIN PRINCIPEES OF CLASSIFICATION 

The test applied in the preceding section does not bring 
out the most significant uses of a two-fold principle of classi- 
fication. In treating the four ceils as we have, we have not 
made full use of the information we possess about them. The 
variance between cells, measured by the sum 66,970.60, with 
3 degrees of freedom, represents the combined influence of 
forces related to the perishable-durable principle of classifi- 
cation, to the raw-manufactured principle, and to the inter- 
action among forces related to these two principles. We 
may apply more refined tests, and obtain more accurate 
information about the differential price behavior of com- 
modities of different types, by distinguishing the components 
of the variance between cells. This is done in Table J, 
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wMch presents a complete breakdown of the total variance 
of the observations with which we are working. 

Table J 


Price Momments^ 1926 — 

February j 1933 

Nature of 

(1926 = 100) 
Degrees of 

Sum of 

variability 

freedom 

squares 

Between perishable-diirabie 

' classes 

1 

29.463.31 

Between raw-manufactured 

classes 

1 

35 , 514.75 

Interaction (residual varia- 

bility between cells) 

1 

1,992.54 

Within cells experimental 

error 

666 

262,059.28 


669 

329,029.88 


Variance 

<r^ 

29 , 463.31 
35 , 514.75 
1 , 992. 55 
393.48 


Having these components we may test with greater ac- 
curacy than on pages 683 and 684 the significance of the two 
main principles of classification. For we now have a better 
yardstick, a better measure of the magnitude of variations 
due to the play of “chance.” The variability within cells 
(variance = 393.48) is a better criterion of the magni- 
tude of sampling errors than is the variability within the 
perishable and durable classes (variance = 448.45) or the 
variability within the raw and manufactured classes (vari- 
ance = 439.39). For the variance within the four cells is 
free of the influence of forces connected with either of the 
specified principles of classification. 

This more accurate test of the perishable-durable prin- 
ciple of classification is applied by the customary method. 


Nature. of Degrees of 

variability freedom 

Between perishable-durable 
classes 1 ' 

Within cells 666 


Vanance 


29,463. 31 
393.48 


Loge.or^ 


10.29090D 

5.975035 


Diff. - 4.315865 

: g =, 2..16;. 
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The 1 per cent value of z is approximately .95, and the 
above result is clearly significant. The application of the 
perishable-durable principle of classification, under the con- 
ditions represented in Table I, yields classes of commodities 
that differed significantly in their price changes between 
1926 and February, 1933. It is important to note that raw 
and manufactured goods are present in the perishable and 
durable groups in precisely the same proportions. One fifth 
of the commodities in each group are raw materials and 
four fifths are manufactured goods. Thus behavior peculiar 
to raw materials may be expected to influence the two 
groups in precisely the same degree; the same is true of 
behavior peculiar to goods in the manufactured state. ^ It is 
necessary, for this reason, that the frequencies in the several 
classes be proportional in the application of the tests here 
discussed, when two principles of classification are Jointly 
employed.^ 

A test of the significance of the raw-manufactured prin- 
ciple of classification may be applied in the same way. 
The variance within cells is employed as yardstick, as in the 
preceding example. Here, also, proportionality is necessary, 
with raw and manufactured goods being divided in the same 
proportions int® perishable and durable sub-groups. The 
test reveals a significant difference in price behavior between 
raw and manufactured commodities. 

A TEST OF THE INTEEACTION 

Not all the variability between cells is explained by the 
two major classifications we have Just discussed. The 
residual variability between cells, or the mierocfoon, amounts 
to 1,992.54, in terms of squared deviations (see Table J). 

^ See beiow, however, for a test of the significance of the interaction. 

For a discussion of x)rocedures appropriate to cases in which ceil frequencies 
are not proportional see 

Yates, F. Journal of Agricultural Science j VoL 23, 108 (1933). 

Bnedecor, G. W. and Cox, G. M. Iowa Agricultural Experiment Station 
Bulleiin 180 (1935). 
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j This may be derived readily by subtracting from the total 
j variability between cells (66,970 . 60) the sum of the variabil- 
; ity between perishable-durable classes (29,463.31) and the 
\ variability between raw-manufactured classes (35,614.75). 

The number of degrees of freedom in the interaction may be 
i determined by the same process of subtraction. In the pres- 
? ent instance it is 1. 

This residual variability may represent “experimental 
error, ” the play of the same chance forces that are measured 
; by the variability within cells. The residual variability 
was used, in the last example cited in Chapter XV, as a yard- 
stick defining the magnitude of fluctuations due to chance. 
It is proper to assume that this is the case when the two 
i major principles of classification are quite independent of 
one another. But if these principles are correlated, the re- 
sidual variability reflects the interaction of the two prin- 
i ciples of classification — the differential behavior of given 
I classes of goods under the influence of forces related to the, 
; other principle of classification. Thus it may be that the 

! difference between raw perishable and manufactured perish- 

able goods is not the same as the difference between raw 
durable and manufactured durable goods. The process of 
fabrication applied to perishable goods may»produce results 
(in the form of price behamor) different from those produced 
; when the process of fabrication is applied to durable goods. 

Perishable and durable goods may respond differently, as 
j regards their price behavior, to the influence of fabrication. 

; Such differential behavior of categories of goods under the 
influence of the same treatment (i.e., fabrication) is meas- 
’ ured by the interadion. 

If there is no such differential behavior, in a given ex- 
periment, the residual variability between cells will be of 
I the same order of magnitude as the variability within cells, 

1 when account is taken of number of degrees of freedom. A 

] test is applied on page 690. 

I If we judge this result with reference to the 1 per cent 
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Nature of Degrees of Sum, of Variance r j 

variability freedom squares ^ 

Interaction (residual vari- 
ability between cells) 1 1,922.54 1,922.54 7.561429 

Within cells 666 262,059.28 393.48 5.975035 

Diff. = 1.586394 
z= .79 

value of z (.95), we would conclude that the residual vari- 
ability between cells is attributable to the play of chance 
rather than to any true interaction. For although the resid- 
ual variability is greater than the variance within cells 
which we use as yardstick, the excess is not clearly too great 
to be attributed to chance. Reference to the 5 per cent 
value of z (.675, for ni = 1, n* = 666 ) throws more light 
on the situation. Less frequently than 5 times out of 100 
would the play of chance alone give us a measure of resid- 
ual variability as great as that here obtained. For the z 
of . 79 is greater than the 5 per cent value, . 675. In such a 
case as this, where P falls between .01 and .05, the e\ddence 
is not conclusive. There is, however, a strong indication 
that perishable and durable goods respond differently, in 
their price behavior, to the process of fabrication. Reference 
to Table I will show that among both perishable and dur- 
able goods fabrication appears to have reduced susceptibility 
to price decline under the force of business recession. 
is distinctly greater than Mi, and If 4 is greater than Ms. 
But the influence of fabrication was apparently greater 
among perishable than among durable goods. ^ Our test 
shows that the degree of difference between the two reduc- 
tions (i.e., reductions in degree of price decline) is almost 
too great to be attributed to chance. The evidence of differ- 
ential behavior is strong enough to justify further investi- 
gation. 

‘ The statistical evidence does not, of course, yield information as to the 

nature of the causal relations involved. The test here, .applied, if positive, 
reveals the presence of interaction, but does not show how the forces involved 
interact to bring about the observed differential behavior. The text is to be 
read with this qualification in mind. 
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GLOSSARY OP SYMBOLS 

The following are the more important symbols employed 
in the preceding pages. Those of which limited use is made, 
for special purposes, are not here included. A given symbol 
is sometimes called upon to serve different purposes, but the 
precise meaning should be clear from the context. 

1. General symbols for variables and constants: 

x: a variable quantity. 
y: a variable quantity. 

In general, any letter near the end of the alphabet may 
be employed to represent a variable quantity. Different 
variable quantities may be represented by the use of a 
single symbol, with different subscripts, as h, X2, Xz, 
or IFi, 1^2, IF3. [A distinction is later drawn (cf. Sym- 
bols employed in the measurement of relationship) 
between capital letters and small letters, as used to 
represent variable quantities.] 
a: a constant (i.e., a quantity the value of which does not 
ehaiige in the given discussion). In general, any letter 
near the beginning of the alphabet may be used to 
represent a constant. 

2. Symbols employed in the analysis and description of 
■ the frequency distribution: 

m:: the value of an individual observation; the value of the 
mid-point of a class. (The symbols ai, a2, m are some- 
times employed to represent different observations in 
a series.) , 

• /: the number of observations in a given class; the frequency 

: of a, given, class. 
i: the class-interval. ' 

I: the lower limit of. a class. ■ 

N: the total number, of cases in a given series or frequency 
distribution.,',.; 
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d: the deviation of a given observation from an average* 
usually, a deviation from the arithmetic mean. When 
written with a subscript, as dx or it refers to a devia- 
tion from the arithmetic mean of the variable repre- 
sented by the subscript. The symbol d is sometimes 
used to designate the difference between mean and 
mode. 

the deviation of a given observation from an arbitrary 
origin, or assumed mean. 

c: the difference between an arbitrary origin, or assumed 
mean, and the true mean (in terms of the symbols ex- 
plained below, c = M — M')- 

S (Sigma) : the symbol for the process of summation. Thus 
Sd means the sum of all the deviations. 

Wij weights attached to a series of measures being 

averaged. (Not to be confused with similar symbols 
used to represent different variable quantities.) 

ya: the maximum ordinate of a frequency curve. 

Sy7nbols for averages, quartiles, etc.: 

M: the aritlimetic mean. 

Md.: the median. 

Mo,: the mode. 

Mg: the geometric mean. 

H: the harmonic mean. 

M': the value of an assumed arithmetic mean. 

Qi: the first or lower quartile. 

Qs* the second quartile or median. 

Q^: the third or upper quartile. 

:K: the value of a point midway between the first and third 
quartiles. 

Ds: the third decile. 

Symbols for fneastires of variation and skewness: 

M.D,: the mean deviation. 

a: the standard deviation; the root-mean-square deviation 
about the arithmetic mean. 

cr'; the standard deviation of proportions, or relative fre- 
quencies. 

Sa: the root-mean-square deviation about an origin other 
tlian the arithmetic mean. 

P.E,: the probable error. 
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rQ.D.: the qiiartile deviation. 

qi: the difference between the median^ and the lower quartile 
, {Md, 

■ gs: the difference between the upper quartile and the median 
(Qa - Md.), 

V: the coefficient of variation. 
sk: a measure of skewness. 

X (Chi): a measure of skewness based upon the criteria | 3 i 
and ^2- 


Symbols for mmients and criteria of curve type. 

Vh ^2, vzy etc.: moments of a frequency distribution about an 
arbitrary origin, 

TTi, ^2, rs, etc. : uncorrected moments of a frequency distribu- 
tion about the arithmetic mean. 

Mij P23 ^3, etc. : moments of a frequency distribution about the 
arithmetic mean after the application of Sheppard^s 
corrections. 


0i: 

ft* 

K2: 


Eli., 

tL. 

0 

A criterion of curve type based on and ^2. 


3 . Symbols relating to index numbers. 

pof* price of a given commodity at time “ 0 ^^ (the base period). 
go^* quantity of same commodity at time 
pf: price of same commodity at time “ 1 ”. 
qf: quantity of same commodity at time ‘‘ 1 ”. 
pf': price of a second commodity at time ^^ 0 ”. 
go'': quantity of second commodity at time 
Pi": price of second commodity at time 
gi": quantity of second commodity at time 

f 

a price relative' (relation of price of a given commodity 

V Pq ■ ■ ■ ■' ' ^ ■■■ 

at time “1 ” to price of same commodity at time “0 ”). 

a quantity relative, 

go 
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4» Symbols employed in the measurement of relationship. 
X: an observed value of a variable quantity. 

Y: an observed value of a variable quantity. (The observed 
values of different variables may be represented also by 
_ the symbols Xi, Xz, or Wij IT3.) 

X: the arithmetic mean of a number of observed values of 
the variable Z. A similar symbol' may be employed for 
other variables. (In one demonstration in the preceding 
pages, relating to multiple correlation, the symbols Ai, 
A2 ,Az... are used to represent the arithmetic means 
of the variables Zi, Z2, Z3 . . . . The symbols 
and My are occasionally employed to designate the 
arithmetic means of different variables.) 
x: value of a variable quantity expressed as a deviation 
from the arithmetic mean of all the observed values. 
The symbol y and the s3miboIs xi, X2, xs . . . are 
similarly employed with respect to variables repre- 
sented, as to original observations, by the symbols 
F, Zi, Z2, Z3 .... 

Z': a value of a variable quantity expressed as a deviation, 
in class-interval units, from an arbitrary origin. The 
symbol F' has a similar meaning. 

X^\* a value of a variable quantity expressed as a deviation, 
in original units, from an arbitrary origin. The sym- 
bol has a similar meaning. 

Yc: the computed or estimated value of a variable, as de- 
termined from an equation of average relationship; 
the symbol yc may be employed for such a computed 
value, expressed as a deviation from the mean. 
p: the mean product of two variables when expressed as 
deviations from their respective arithmetic means, i.e., 

p s= When written with subscripts, as P12, the 

latter relate to the variables in question, as 5:1, 3:2. 
p': the mean product of two variables when expressed as 
deviations from assumed arithmetic means. 

coejBBcient of correlation. When written 
; r the latter indicate the variables to 

coefficient, relates. Thus T^a: refers to the 
variables y and x, and ri2 refers to the variables xi 
and X2. 

p (rho): a general index of correlation. Subscripts should be 
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employed to indicate the variables to which the meas- 
ure relates, as pyx, pxy, Piog vx, piog y log x, Pi , etc. 

(In each case the first subscript relates to the depend- 
ent variable.) 

p: a corrected index of correlation. 

d: the deviation of a given observation from a fitted curve; 
the difference between an observed and a corresponding 
computed value of a variable. 

i;; a residual; identical in meaning with d, as given above. 

S: the root-mean-square deviation about a fitted curve; 
the standard error of estimate. This measure should 
be written with a subscript to indicate the variable to 
which it applies, as 8y, Sx, Siog y (the standard error of 
estimate in terms of logarithms), Sr (the standard error 
of estimate in terms of ratios), Si (the standard error 

y 

__ of estimate in terms of reciprocals). 

S: a corrected standard error of estimate. 

Pr* the coefficient of rank correlation. 
fj (eta): the correlation ratio. Subscripts should be employed 
to represent the variables to which the measure re- 
lates, as 7}yx or 7]xy. The first subscript in each case 
relates to the dependent variable. 

7j: a corrected correlation ratio. 
cTaj/.’ the root-mean-square deviation about a line through the 
means of the columns of a correlation table; the stand- 
ard deviation of the ^-arrays about their respective 
means. The symbol aaz has the same meaning with 
respect to the rows of a correlation table, or the 
a:-arrays. 

the standard deviation of the means of the columns of 
a correlation table about the mean of all the |/^s, the 
mean of each column being weighted by the number of 
items in that column. The symbol o-m® has the same 
meaning with respect to the means of the rows. 
t ( 2 :eta):The test for linearity of regression ' (f = r^)- 

ot: themumber of arrays ' employed in . the computation of a 
given correlation ratio; also, the number of constants 
ill the equation ■ defining a ■ curvilinear or multiple 
regression. 

b:. the coefficient of regression; the slope of a line of regres- 

with subscripts,, the late relate 
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to the variables in question j as bysj bn (for the variables 
xz). The first subscript ■ relates to the dependent 
variable in each case; byx is the coefficient of regression' 
of y on X and bxy is the coefficient of regression of x on y. 
z: a logarithmic transformation of the coefficient of corre- 
lation. z = iflog^, (1 + t) — log^, (1 -* r)}. 
i?i.234* the coefficient of multiple correlation between a de- 
pendent variable, Xi, and a combination of independent 
variables, X2, xs, and x^. The order may be changed, 
but the primary subscript always relates to the de- 
_ pendent variable. 

iS 1.234* a corrected coefficient of multiple correlation. 
ri2.34’ the coefficient of partial or net correlation between the 
variables Xi and when the variables Xz and are 
held constant. The order of subscripts is changed for a 
different combination of variables, the two primary 
subscripts always relating to the variables between 
which the net correlation is being measured. 

&12.34- the coefficient of net regression between the variables 
Xi and X2, the former being dependent, when the vari- 
ables Xg and Xi are also taken account of in the estimat- 
ing equation; the weight given to X2 in estimating an, 
when the estimate is also laased upon ^'-allies of xg and 
Xi. The order of subscripts is changed for a different 
combination of variables. 

iSi.234* the root-mean-square deviation about a line describing 
the relationship between a dependent variable, xi, and 
a series of independent variables, X2, X3, and X4; the 
standard error of estimate of Xi under these conditions. 
^yi.234** the standard deviation of the fourth order; identical 
with Si.234. 

i3i2.34.‘ a coefficient of partial regression in an equation relating 
to variables expressed in standard deviation units. 

(In the seven measures immediately above, the number of subscripts corre- 
sponds to the number of variables included in a given study. For the sake of sim- 
plicity, only four variables have been assumed.) 

5. Symbols employed in the measurement of errors. 

or: the standard deviation of a parent population, 
or cr^: the standard error of a mean, derived from a knowledge 
of the <r of the population. 
s: the standard deviation of a sample. 
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sm or s^.: the estimated standard' error of a mean, in the deriva- 
tion of which s is used as an approximation to o*. 

T: the deviation of a given statistical measurement from 
the mean of a normal distribution, expressed in units 
of the standard deviation of that distribution; a normal 
deviate. 

t: the deviation of a given statistical measurement from a 
hypothetical value, expressed in units of the estimated 
standard error of the measurement in question. 

(TUs: the standard error of the mean of a stratified sample. 

D: a difference between two means. 

cm: the standard error of the difference between two meaUvS. 

Dp: a difference between two percentages. 

(fD^: The standard error of the difference between two per- 
centages. 

Dz.: the difference between two logarithmic transformations 
of the coefficient of correlation. 

(Td^: the standard error of Dz. 

cr, with any subscript, is used to represent the standard 
error of the measure to which the subscript relates, 
P.E. with any subscript is used to represent the prob- 
able error of the measure to which the subscript re- 
lates (P.E. = .67449cr). 

the standard error of the difference between two coeffi- 
cients of regression. 

6. Symbols employed in the analysis of val’iance. 

s; the difference between the natural logarithms of two 
standard deviations, 
the standard error of s. 

wi: the number of degrees of freedom in the larger of two 
variances being compared. 

n^: the number of degrees of freedom in the smaller of two 
variances being compared. 

7. Gther symbols. 

. p; . the probability of , a ; successful outcome of a ; given 

■: , .event. ' ' ' 

g: the probability of an unsuccessful outcome of a given 
event. ■■ 

n: the number of independent. events in a given, trial. , 
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X*: a quantity used in testing hypotheses involving the 
computation of theoretical frequencies; x^ defines the 
relative magnitude of the differences between observed 
and theoretical frequencies. 


Greek Alphabet 


Letters 

Names 

Letters 

Names 

Letters 

Names 

A a 

Alpha 

I i 

Iota 

Pp 

Rho 

B jS 

Beta 

K « 

Kappa 

2 <T 

Sigma 

r 7 

Gamma 

A \ 

Lambda 

T T 

Tau 

A S 

Delta 

M/i 

Mu 

Tu 

UpsiloD 

E € 

Epsilon 

N V 

Nu 

<f> (j) 

Phi 

Z ^ 

Zeta 

S 1 

Xi 

Xx 

Chi 

H 7] 

Eta 

0 0 

Omicron 


Psi 

® 0 

Theta 

n TT 

Pi 

fl w 

Omega 


t-' 



Appendix Table I 


Areas of the Normal Curve of Error in Terms of Abscissa 


X 

ff 

M 

.01 

.02 

.03 

.04 

.05 

.06 

.07 .08 

.09 

0.0 

.OWOO 

.00399 

.00798 

.01197 

.01595 

.01994 

.02392 

.02790 .03188 

. 03586 

0.1 

.03983 

.04380 

.04776 

.05172 

.05567 

.05962 

.06356 

.06749 . 07142 

. 07535 

0.2 

.07926 

.08317 

.08706 

.09095 

.09483 

.09871 

.10257 

.10642 .11026 

.11409 

0.3 

.11791 

. 12172 

. 12552 

.12930 

. 13307 

.13683 

.14058 

.14431 .14803 

. 15173 

0.4 

.15542 

.15910 

. 16276 

.16640 

.17003 

.17364 

.17724 

.18082 .18439 

. 18793 

0.5 

.19146 

. 19497 

. 19847 

.20194 

.20540 

.20884 

.21226 

.21566 .21904 

.22240 

0.6 

.22575 

.22907 

.23237 

.23565 

.23891 

.24215 

.24537 

.24857 .25175 

.25490 

0.7 

.25804 

.26115 

.26424 

.26730 

.27035 

.27337 

.27637 

.27935 .28230 

.28524 

0.8 

.28814 

.29103 

.29389 

.29673 

.29955 

.30234 

.30511 

.30785 .31057 

.31327 

0.9 

.31594 

.31869 

.32121 

.32381 

.32639 

.32894 

.33147 

.33398 .33646 

-33891 

1.0 

.34134 

.34376 

.34614 

.34850 

.35083 

.35314 

.36543 

.35769 .35993 

.36214 

Li 

.36433 

.36650 

.36864 

.37076 

,37286 

.37493 

.37698 

.37900 . 38100 

.38298 

1.2 

.38493 

.38686 

.38877 

.39065 

.39251 

.39435 

.39617 

.39796 .39973 

.40147 

1.3 

.40320 

.40490 

.40658 

.40824 

.40988 

.41149 

.41309 

.41466 .41621 

.41774 

1,4 

.41924 

.42073 

.42220 

.42364 

.42507 

.42647 

.42786 

.42922 . 43056 

.43189 

1.5 

,43319 

.43448 

.43574 

.43699 

.43822 

.43943 

.44062 

.44179 .44295 

.44408 

1.6 

.44520 

.44630 

.44738 

.44845 

.44950 

.45063 

.45164 

.45254 .45352 

.46449 

1.7 

.45543 

.45637 

.45728 

.45818 

.45907 

.45994 

.46080 

.46164 . 46246 

.46327 

1.8 

.46407 

.46485 

.46662 

.46638 

.46712 

.46784 

.46856 

.46926 .46995 

.47062 

1.9 

.47128 

,47193 

.47257 

.47320 

.47381 

.47441 

.47600 

.47558 .47615 

.47670 

2.0 

.47725 

.47778 

.47831 

.47882 

.47932 

.47982 

.48030 

.48077 .48124 

.48169 

2.1 

.48214 

.48257 

.48300 

.4^341 

.48382 

.48422 

.48461 

.48500 .48537 

.48574 

2.2 

.48610 

.48645 

.48679 

.48713 

.48745 

.48778 

.48809 

.48840 .48870 

.48899 

2.3 

.48928 

.48956 

.48983 

.49010 

.49036 

.49061 

.49086 

.49111 .49134 

.49158 

2.4 

.49180 

.49202 

.49224 

.49245 

.49266 

.49286 

.49305 

.49324 .49343 

.49361 

2.5 

.49379 

.49396 

.49413 

.49430 

.49446 

.49461 

.49477 

.49492 . 49506 

.49520 

2.8 

.49634 

.49547 

.49660 

.49573 

.49585 

.49698 

.49609 

.49621 .49632 

.49643 

2.7 

.49663 

.49664 

.49674 

.49683 

.49693 

.49702 

.49711 

.49720 .49728 

. 49736 

2.8 

.49744 

.49762 

.49760 

.49767 

.49774 

.49781 

.49788 , 

.49795 . 49801 

.49807 

2.9 

.49813 

.49819 

.49825 

.49831 

.49836 

.49841 

.49846' .49851 .49856 

.49861 

3.0 

.49865 

.49869 

.49874 

.49878 

.49882 

.49886 

.49889 

.49893 .49897 

.49900 

3.1 

.49903 

,49906 

.49910 

.49913 

.49916 

.49918 

.49921 

.49924 .49926 

. 49929 

3.2 

.49931 

.49934 

.49936 

.49938 

.49940 

.49942 

.49944 

.49946 .49948 

.49950 

3.3 

.49962 

.49963 

,49955 

.49967 

.49968 

.49960 

.49961 

.49962 .49964 

.49965 

3.4 

.49966 

.49968 

.49969 

.49970 

.49971 

.49972 

.49973 

.49974 . 49975 

.49976 
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Table of t 


n 

■ P - .05 

.02 

. .'.01 

. 1 

12.706 

31.821 

63.657 

2 

4.303 

6.965 

9.925 

3 

3.182 

4.541 

5.841 

4, 

2.776 

3.747 

4.604 

5 

2. 571 

3.365 

4.032 

6 

2.447 

3.143 

3.707 

7 

2.365 

2.998 

3.499 

8 

2.306 . 

2.896 

3.355 

9 

2.262 

2.821 

3.250 

10 

2.228 

2.764 

3.169 

11 

2.201 

2.718 

3.106 

12 

2.179 

2.681 

3.055 

13 

2.160 

2.650 

3.012 

14 

2.145 

2.624 

2.977 

15 

2.131 

2.602 

2.947 

16 

2.120 

2.583 

2.921 

17 

2.110 

2.567 

2.898 

18 

2.101 

2.552 

2.878 

19 

2.093 

2.539 

2.861 

20 

2.086 

2.528 

2.845 

21 

2.680 

2.518 

2.831 

22 , 

2.074 

2.508 

2.819 

23 . 

2.069 

2.500 

,2.807 

24' 

2.064 

2.492 

2.797 

25 

2.060 

2.485 

2.787 ■ 

.26 ' 

2.056 

2.479 

,2.779 , 

27 

2.052 

2.473 ■ 

,2.771. 

28 

2.048 

2.467 

2.763 

29 

2.045 

2.462 

2.; 756'' 

30 

2.042 

' 2.457 

;2.750 " 

. 00 '.' 

1.95996 

2.32634 

' ,; , 2.57582 


^ Excerpts from Table IV, R. A. Fisher,. Statistdcal Methods for Research 
Workers. These excerpts are printed here through the' courtesy .of" D .Fisher 
and his publishers, Oliver and Boyd, of 'Edinburgh. ' 
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Values of the Correlation Coefficient for Different Levels of 

Significance 


n 

P = .05 

.02 

.01 

1 

.996917 

.9995066 

.9998766 

2 

.95000 

.98000 

.990000 

3 

.8783 

.93433 

.95873 

4 

.8114 

.8822 

.91720 

5 

.7545 

.8329 

.8745 

6 

.7067 

.7887 

.8343 

7 

.6664 

.7498 

.7977 

8 

.6319 

.7155 

.7646 

9 

.6021 

.6851 

.7348 

10 

.5760 

.6581 

.7079 

11 

. 5529 

.6339 

.6835 

12 

.5324 

.6120 

.6614 

13 

.5139 

.5923 

.6411 

14 

.4973 

.5742 

.6226 

15 

.4821 

.5577 

' .6055 . ' 

16 

.4683 

.5425 

.5897- 

17 

.4555 

.5285 

. 5751 , 

18 

.4438 

.5155 

.5614 

19 

.4329 

.5034 

.5487 

20 

.4227 

.4921 

.5368 

25 

.3809 

.4451 ^ 

.4869 

30' 

.3494 

.4093 

.4487 

35 

.3246 

.3810 

.4182 ' 

40 

.3044 

.3578 

.,3932 

45 . 

.2875 

.3384 

.3721 

50 

.2732 

.3218 

.3541 

60... 

• .2500 

.2948 

,3248'.. 

70 

.2319 

.2737 

.3017. . 

80., ^ ^ 

.2172 

.2565 

.'2830 . 

90 : ^ 

.2050 

.2422 

,2673 ' 

100 

: .1946 

.2301 

.2540, 

o 

a .total, correlation, 

:??.,■ is 2. less .than the, number of pairs in the 


sample; for a ' partial correlation., ,tbe number of eliminated variates alsc:^ 
sliould be subtracted. 

1 Excerpts from Table V-A, R. A. Fisher, Statistical Meihois for Research 

Workers. These excerpts are printed here' through. tte courtesy of Dr, hisiiei 
and his publishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table IV 


Showing the Relaiio7is between r and z for Values of z from Q to 5^ 


3 .00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

,0 .0000 

.0100 

.0200 

.0300 

.0400 

.0500 

.0599 

.0699 

.0798 

.0898 

.1 .0997 

.1096 

.1194 

.1293 

.1391 

.1489 

.1587 

.1684 

.1781 

.1878 

,2 .1974 

.2070 

.2165 

.2260 

.2355 

.2449 

.2543 

. 2636 

.2729 

.2821 

.3 .2913 

.3004 

.3095 

.3185 

.3275 

.3364 

.3452 

.3540 

.3627 

.3714 

.4 .3800 

.3885 

.3969 

.4053 

.4136 

.4219 

.4301 

.4382 

.4462 

.4542 

.5 .4621 

.4700 

.4777 

.4854 

.4930 

.5005 

.5080 

.5154 

.5227 

.5299 

.6 .5370 

.6441 

.5511 

.5581 

.5649 

.5717 

.5784 

.5850 

.5915 

.5980 

.7 .6044 

.6107 

.6169 

.6231 

.6291 

.6352 

.6411 

.6469 

.6527 

.6584 

.8 .6640 

.6696 

.6751 

,6805 

.6858 

.6911 

.6963 

.7014 

.7064 

.7114 

.9 .7163 

.7211 

.7259 

.7306 

.7352 

.7398 

.7443 

.7487 

.7531 

.7574 

1.0 .7616 

.7658 

.7699 

.7739 

.7779 

.7818 

.7857 

.7895 

.7932 

.7969 

1.1 .8005 

.8041 

,8076 

.8110 

.8144 

.8178 

.8210 

.8243 

.8275 

.8306 

1.2 .8337 

.8367 

.8397 

.8426 

.8455 

.8483 

.8511 

.8538 

.8565 

.8591 

1,3 .8617 

.8643 

.8668 

.8693 

.8717 

.8741 

.8764 

.8787 

.8810 

.8832 

1.4 .8854 

.8875 

.8896 

.8917 

.8937 

.8957 

.8977 

. 8996 

.9015 

.9033 

1.5 .9052 

.9069 

.9087 

.9104 

.9121 

.9138 

.9154 

.9170 

.9186 

.9202 

1.6 .9217 

.9232 

.9246 

.9261 

.9275 

.9289 

.9302 

.9316 

.9329 

.9342 

1.7 .9354 

.9367 

. 9379 

,9391 

.9402 

.9414 

.9425 

.9436 

.9447 

.9458 

1.8 .9468 

.9478 

. 9498 

.9488 

.9508 

.9518 

.9527 

.9536 

.9545 

.9554 

1.9 .9562 

.9571 

. 9579 

.9587 

.9595 

.9603 

.9611 

.9619 

.9626 

.9633 

2.0 .9640 

.9647 

.9654 

.9661 

.9668 

.9674 

.9680 

.9687 

.9693 

,9699 

2.1 .9705 

.9710 

.9716 

.9722 

.9727 

.9732 

.9738 

. 9743 

.9748 

.9753 

2.2 .9757 

.9762 

.9767 

.9771 

.9776 

.9780 

.9785 

.9789 

.9793 

.9797 

2.3 .9801 

.9805 

.9809 

.9812 

.9816 

.9820 

.9823 

.9827 

.9830 

.9834 

2.4 ,9837 

.9840 

.9843 

.9846 

.9849 

.9852 

.9865 

.9858 

.9861 

.9863 

2.5 .9866 

.9869 

.9871 

.9874 

.9876 

.9879 

,9881 

.9884 

.9886 

.9888 

2.6 .9890 

.9892 

.9895 

.9897 

,9899 

.9901 

. 9903 

.9905 

.9906 

.9908 

2.7 .9910 

.9912 

,9914 

.9915 

.9917 

.9919 

.9920 

. 9922 

.9923 

.9925 

2.8 .9926 

.9928 

.9929 

.9931 

.9932 

.9933 

. 9935 

.9936 

,9937 

.9938 

2.9 ' ,.9940 

3.0 .9951 

4.0 .9993 
6.0' .9999 

.9941 

.9942 

.9943 

.9944 

.9945 

.9946 

.9947 

.9949 

.9950 


^The figures in the body of the table are values of r corresponding to 
z-values read from the scales on the left and top of the table. 
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Appendix Table Y ^ 


Table of 


n 

P = M 

JI5 

.50 

.10 

.05 

.02 

.01 

1 

.000157 

.00393 

. 455 

2.706 

3.841 

5.412 

6.635 

2 

.0201 

. 103 

1,386 

4 . 605 

5.991 

7.824 

9.210 

3 

.115 

.352 

2. 366 

6.251 

7.815 

9.837 

11.341 

4 

.207 

.711 

3.357 

7.779 

9.488 

11.668 

13.277 

5 

.554 

1 . 145 

4.351 

9.236 

11.070 

13.388 

15.086 

6 

.872 

1 . 635 

5.348, 

10.645 

12.592 

15.033 

16.812 

7 

1.239 

2.167 

6.. 346 

12.017 

14.067 

16.622 

18.475 

8 

L646 

2,733 

7.344 

13.362 

15.507 

18.168 

20.090 

9 

2.088 

3,325 

8.343 

14.684 

16.919 

19.679 

21.666 

io 

2.558 

3.040 

9.342 

15.987 

18.307 

21.161 

23.209 

11 

3 . 053 

4.575 

10.341 

17.275 

19.675 

22.618 

24.725 

12 

3.571 

5,226 

11.340 

18.549 

21.026 

24.054 

26.217 

13 

4. 107 

5.892 

12.340 

19.812 

22.362 

25.472 

27.688 

14 

4.660 

6.571 

13.339 

21.064 

23.685 

26.873 

29.141 

15 

5.229 

7,261 

14.339 

22.307 

24.996 

28.259 

30.578 

16 

5.812 

7.962 

15.338 

23.542 

26.296 

29.633 

32.000 

17 

6.408 

8.672 

16.338 

24.769 

27.587 

30.995 

33.409 

18 

7.015 

9.3iM) 

17.338 

25.989 

28.869 

32.346 

34.805 

19 

7.633 

10.117 

18.338 

27.204 

30.144 

33.687 

36.191 

20 

8.260 

10.851 

19.337 

28.412 

31.4^0 

35.020 

37.566 

21 

8.897 

ii.mi 

20,337 

29.615 

32.671 

36.343 

38.932 

22 

9.642 

12.338 

21.337 

30.813 

33.924 

37.659 

40.289 

23 

10.196 

13.(191 

22.337 

32.007 

35.172 

38.968 

41.638 

24 

10.856 

13.848 

23,337 

33.196 

36.415 

40.270 

42.980 

25 

1L524 

14.611 

24.337 

34.382 

37.652 

41.566 

44.314 

26 

12.198 

15.370 

25.336 

35.563 

38.885 

42.856 

45.642 

27 

12.879 

16.151 

26.336 

36.741 

40.113 

44.140 

46.963 

28 

13.565 

16.928 

27.336 

37.916 

41.337 

45.419 

,48.278 

29 

14.256 

17.708 

28.336 

39.087 

42.557 

46.693 

49,588 

30 

14.953 

18.493 

29.336 

40.256 

43.773 

47.962 

50.892 

For larger values of the expression 

— y^2n 

- 1 may. 

be used 


as a normal deviate wtli unit standard error. 

i Excerpts from Table III, R. A, Fisher, Statistical Methods for Resmrch 
Workers, These excerpts are printed here through the courtesy of Dr., Fisher 
and his publishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table VI ^ 

1 Per Cent Points of the Distribution of i 


I 4.1530 I 4.2585 
I 2,2950 i 2.2976 
' 1.7640 1.7140 
1.5270 1.4452 
i 1.3943 1.2029 


6 

1.3103 

7 

1 . 2526 

8 

1.2106 

9 

1.1786 

JO 

1 . 1535 

11 

1.1333 

12 

1.1166 

13 

1 . 1027 

14 

1 . 0909 

15 

1 . 0807 

16 

1 . 0719 

17 

1.0641 

IS 

1.0572 

19 

1 .0511 

20 

1 . 0457 

21 

1 .0408 

22 

1.0363 

23 

1 . 0322 

24 

1.0285 

25 

1.0251 

26 

1.0220 

27 

1.0191 

28 

1.0164 

29 

1.0139 

30 

1.0116 

60 

. 9784 

00 1 

. 9462 


Values ofni 


‘A. 

I &. 

! 4.3175 

4.3297 

2.2988 

2.2991 

1.67S6 

1.6703 

1 . 8856 

1.3711 

1.2164 

1 . 1974 

1 . 1068 

1.0S43 

1.0300 

1.0048 

.9734 

.9459 

. 9299 

.9006 

.8954 

.8646 

.8674 

.8354 

.8443 

.Sill 

. 824s 

.7907 

.8082 

.7732 

. 7939 

. 7582 

.7814 

.7450 

. 7705 

.7335 

. 7607 

. 7232 

.7521 

.7140 

. 7443 

.7058 1 

. 7372 

. 6984 ! 

. 7309 

.6916 I 

. 7251 

. 6855 

.7197 

.6799 

.7148 

.6747 

.7103 

.6699 

,7062 

.8655 

. 7023 

.6614 

,6987 

.6576 

.6954 

.6540 

.6472 

.6028 

. 5999 

.5522 


4.3482 
2.2994 
1.6569 
1 . 3173 
1 . 1644 
1.0460 
.9614 
.8983 
.8494 
.8104 


4.3585 I 4.3689 
2.2997 1 2.2999 
1.6489 I 1.6404 
1.3327 1.3170 
1 . 1457 1 . 1239 


■ 4604 I .39081 .2913 0 




This tai e is nrintprf btahsticxil Methods for Research Workers. 
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Appendix Table VII ^ 

6 Per Cent Points of the DistribvMon of z 



Values of m | 



1- 

2. 

3. 1 

4. 

! 

6. 

8. 

12. 

24. 



1 

2.5421 

2.6479 

2 

.6870 

2.7071 

2 

.7194 

2.7276 

2.7380 

2.7484 

2.7588 

3 

.7693' 


2 

1.4592 

1.4722 

1 

.4765 

1.4787 

1 

.4800 

1.4808 

1.4819 

1.48 30 

1.4840 

1 

.4851 


3 

1.1577 

1 . 1284 

1 

.1137 

1.1051 

1 

.0994 

1.0953 

1.0S99 

1.0842 

1.0781 

1 

.0716 


4 

1.0212 

.9690 


.9429 

.9272 


.9168 

.9093 

.8993 

.8885 

.8767 


.8639 


5 

.9441 

.8777 


.8441 

.8236 


.8097 

.7997 

.7862 

.7714 

.7550 


.7368 


6 

.8948 

.8188 


.7798 

.7558 


.7394 

.7274 

.7112 

.6931 

.6729 


.6499 


7 

.8606 

.7777 


.7347 

.7080 


.6896 

.6761 

.6576 

.0369 

.6134 


.5862 


8 

.8355 

.7475 


.7014 

.6725 


.6525 

.6378 

.6175 

.5945 

.5682 


.5371 


9 

.8163 

.7242 


. 6757 

.6450 


.6238 

.6080 

.5862 

.5613 

.5324 


.4979 


10 

.8012 

.7058 


.6553 

,6232 


.6009 

.5843 

.5611 

.5346 

.5035 


.4.057 


11 

.7889 

.6909 


.6387 

.6055 


.5822 

.5648 

.5406 

.5126 

.4795 


.438? 


12 

.7788 

.6786 


.6250 

.5907 


.5666 

.5487 

.5234 

■ .4941 

.4592 


.4156 


13 

.7703 

.6682 


.6134 

.6783 


.5535 

.5350 

.5089 

.4785 

.4419 


.3957 


14 

.7630 

.6594 


.6036 

.5677 


.5423 

,5233 

.4964 

.4649 

.4269 


.3782 

IM 

15 

.7568 

.6518 


.5950 

.5585 


.5326 

.5131 

.4855 

.4532 

.4138 


. 3628 


m 

.7514 

.6451 


. 5876 

.5505 


.5241 

.5042 

.4760 

.4428 

,4022 


. 3490 


17 

.7466 

.6393 


.5811 

.5434 


.5166 

.4964 

.4676 

.4337 

,3919 


,336 6 

1 

18 

.7424 

.6341 


.5753 

.5371 


.5099 

.4894 

.4602 

.4255 

.3827 


.3353 

1 ' 

19 

.7386 

.6295 


. 5701 

.5315 


.5040 

.4832 j 

.4535 

.4182 

.374,3 


,3151 

2 

20 

.7352 

.6254 


.5654 

.5265 


.4986 

,4776 ! 

.4474 

.4116 

,3668 


.3057 


21 

.7322 

.6216 


.5612 

.5219 


.4938 

.4725 

,4420 

.4055 

.3599 


.2971 


22 ' 

.7294 

.6182 


. 5574 

.5178 


,4894 

.4679 

.4370! 

.4001 

.3536 


.289 2 


23 

.7269 

.6151 


. 5540 

.5140 


.4854 

.4636 

.4325 

.3950 

.34:78 


.2818 


24 

.7246 

.6123 


.5508 

.5106 


.4817 

.4598 

.4283 

.3904 

.3425 


. 3749 


25 

.7225 

.6097 


.5478 

.5074 


.4783 

.4562 

.4244 

.3862 

.3376 


.2685 


26 

.7205 

.6073 


.5451 

.5045 


.4752 

.4529 

.4209 

.3823 

.3330 


. 262 5 


27 

.7187 

.6051 


.5427 

.5017 


.4723 

.4499 

.4176 

.3786 

.3287 


.2569 


28 

.7171 

.6030 


.5403 

.4992 


.4696 

.4471 

.4140 

.3752 

.3248 


,2516 


29 

.7155 

.6011 


.5382 

.4969 


.4671 

.4444 

.4117 

.3720 

.3211 


.2466 


30 

.7141 

.5994 


.5362 

.4947 


.4648 

.4420 

,4090 

,3691 

.3176 


.2419 


60 

.6933 

,5738 


.5073 

.4632 


.4311 

.4064 

.370® 

,3255 

.2654 


. 1644 


00 

.6729 

.5486 


.4787 

.4319 


. 3974 

. 3706 

,3309 

.2804 

.2085 

0 



^ From Table VI, R. A. Fisher, Statistical Methods for Research Workers. 
This table is printed here through the courtesy of Dr. Fisher and his pub- 
lishers, Oliver and Boyd, of Edinburgh. 


705 




Appendix Table VIII 


Squares of ike Natural Numbers from 100 to 999 


n 

< 

n 


n + 

1 

n 

+ 2 

n + 

3 

— SQUA 
+ 4 

RE 

n 

Oi 

+ 

5 

n + 

6 

n 

+ 

7 

n 

+ 

8 

71 

+ 9 

100 

1 00 

00 

1 02 

01 

1 

04 04 

1 06 

09 

1 08 16 

1 

10 

25 

1 12 

36 

1 

14 

49 

1 

16 

64 

1 

18 81 

no 

1 21 

00 

1 23 

21 

1 

25 44 

1 27 

69 

1 29 96 

1 

32 

25 

1 34 

56 

1 

36 

89 

1 

39 

24 

1 

41 61 

120 

1 44 

00 

1 46 

41 

1 

48 84 

1 51 

29 

1 53 76 

1 

56 

25 

1 58 

76 

1 

61 

29 

1 

63 

84 

1 

66 41 

130 

1 69 

00 

1 71 

61s 

1 

74 24 

1 76 

89 

1 79 56 

1 

82 

25 

1 84 

96 

1 

87 

69 

1 

90 

44 

1 

93 21 

140 

1 96 

00 

1 98 

81 

2 

01 64 

2 04 

49 

2 07 36 

2 

10 

25 

2 13 

16 

2 

16 

09 

2 

19 

04 

2 

22 01 

130 

2 25 

00 

2 28 

01 

2 

31 04 

2 34 

09 

2 37 16 

2 

40 

25 

2 43 

36 

2 

46 

49 

2 

49 

64 

2 

52 SI 

100 

2 56 

00 

2 59 

21 

2 

62 44 

2 65 

69 

2 68 96 

2 

72 

25 

2 75 

56 

2 

78 

89 

2 

82 

24 

2 

85 61 

170 

2 89 

00 

2 92 

41 

2 

95 84 

2 99 

29 

3 02 76 

3 

06 

25 

3 09 

76 

3 

13 

29 

3 

16 

84 

3 

20 41 

130 

3 24 

00 

3 27 

61 

3 

31 24 

3 34 

89 

3 38 56 

3 

42 

25 

3 45 

96 

3 

49 

69 

3 

53 

44 

3 

57 21 

190 

3 61 

00 

3 64 

SI 

3 

68 64 

3 72 

49 

3 76 36 

3 

SO 

25 

3 84 

16 

3 

SS 

09 

3 

92 

04 

3 

96 01 

200 

4 00 

00 

4 04 

01 

4 

08 04 

4 12 

09 

4 16 16 

4 

20 

25 

4 24 

36 

4 

28 

49 

4 

32 

64 

4 

36 81 

210 

4 41 

00 

4 45 

21 

4 

49 44 

4 53 

69 

4 57 96 

4 

62 

25 

4 66 

56 

4 

70 

89 

4 

75 

24 

4 

79 61 

220 

4 84 

00 

4 88 

41 

4 

92 84 

4 97 

29 

5 01 76 

5 

06 

25 

5 10 

76 

5 

15 

29 

5 

19 

84 

5 

24 41 

230 

5 29 

00 

5 33 

61 

5 

38 24 

5 42 

89 

5 47 56 

5 

52 

25 

5 56 

96 

5 

61 

69 

5 

66 

44 

5 

71 21 

240 

5 76 

00 

5 SO 

81 

5 

85 64 

5 90 

40 

5 95 36 

6 

00 

25 

6 05 

16 

6 

10 

09 

6 

15 

04 

6 

20 01 

250 

6 25 

00 

6 30 

01 

6 

35 04 

6 40 

09 

6 45 16 

6 

SO 

25 

6 55 

36 

6 

60 

49 

6 

65 

64 

6 

70 81 

260 

6 76 

00 

6 81 

21 

6 

86 44 

6 91 

69 

6 96 96 

7 

02 

25 

7 07 

56 

7 

12 

89 

7 

18 

24 

j 

23 61 

270 

7 29 

00 

7 34 

41 

7 

39 84 

7 45 

29 

7 50 76 

7 

56 

25 

7 61 

76 

7 

67 

29 

7 

72 

84 

7 

78 41 

280 

7 84 

00 

7 SO 

61 

7 

95 24 

S 00 

89 

S 06 56 

S 

12 

25 

8 17 

96 

8 

23 

69 

8 

29 

44 

8 

85 21 

290 

8 41 

00 

8 46 

SI 

8 

52 64 

8 58 

49 

S 64 36 

8 

70 

25 

8 76 

16 

8 

82 

09 

8 

88 

04 

8 

94 01 

300 

9 00 

00 

9 06 

01 

9 

12 04 

9 18 

09 

9 24 16 

9 

30 

25 

9 36 

36 

9 

42 

49 

9 

48 

64 

9 

54 81 

310 

9 61 

00 

! 9 67 

21 

9 

73 44 

9 79 

69 

9 85 96 

9 

92 

25 

1 9 98 

56 

10 

04 

89 

10 

11 

24 

10 

17 61 

320 

10 24 

00 

110 30 

41 

10 

36 84 

10 43 

29 

10 49 76 

'10 

56 

25 

ilO 62 

76 

10 

69 

29 

10 

75 

84 

10 

82 41 

330 

10 89 

00 

10 95 

61 

11 

02 24 

U 08 

89 

11 15 56 

11 

22 

25 

'11 28 

96 

11 

35 

69 

11 

42 

44: 

11 

49 21 

340 

11 56 

00 

111 62 

81 

11 

69 64! 11 76 

49 

11 83 36 

11 

90 

25 

111 97 

16 

12 

04 

09 

12 

11 

04 

12 

IS 01 

350 

12 25 

00 1 

12 32 

01 

12 

39 04112 46 

09 

12 53 16 

12 

60 

25 

,12 67 

36 

12 

74 

49 

,12 

81 

64 

12 

88 81 

360 

12 96 

00| 

13 03 

21 

13 

10 44 

13 17 

69 

13-24 96 

13 

32 

25 

13 39 

56 

13 

46 

89 

13 

54 

24 

13 

61 61 

370 

13 69 

00 

13 76 

41 

13 

S3 S4 

13 91 

29 

13 98 76 

14 

06 

25 

14 13 

76' 

14 

21 

29 

114 

28 

84 

14 

36 41 

380 1 

14 44 

00 

14 51 

61 

il4 

59 24 

14 66 

89 

14 74 56 

14 

82 

25 

14 89 

96 1 

14 

97 

69 

:i5 

05 

44 

15 

13 21 

390 

15 21 

00 

15 28 

81 

15 

36 64 

15 44 

49 

15 52 36 

15 

60 

25 

15 68 

161 

15 

76 

09 

!15 

84 

04 

IS 

92 01 

400 

16 00 

00 

16 08 

01 

16 

10 04 

16 24 

09 

16 32 16 

10 

40 

25 

10 48 

36! 

16 

56 

49 

|10 

64 

64 

16 

72 81 

410 

16 81 

00 

16 89 

21 

16 

97 44 

17 05 

69 

17 13 96 

17 

22 

25 

17 30 

561 

17 

38 

89 

17 

47 

24 

17 

65 61 

420 

17 64 

00 

17 72 

41i 

17 

80 84 

17 89 

29 

17 97 76 

18 

06 

25 

18 14 

76' 

18 

23 

29 

'18 

31 

84 

18 

40 41 

430 

18 49 

00 

18 57 

61 

18 

66 24 

18 74 

89 

18 83 56 

18 

92 

25 

19 00 

96' 

19 

09 

69 

!l9 

18 

44 

19 

27 21 

440 

19 36 

00 

19 44 

81 

19 

53 64119 62 

49 

19 71 36 

19 

SO 

25 

19 89 

16i 

19 

98 

09 

20 

07 

04 

20 

16 01 

460 

20 25 

00 

20 34 

01 

20*43 04120 52 

09 

20 61 16 

20 

70 

25 

20 79 

36 

20 

88 

49 

20 

97 

64 

21 

06 81 

460 

21 16 

00 

21 25 

21 

21 

34 44 

21 43 

69 

21 52 96 

21 

62 

25 

21 71 

56' 

21 

80 

89 

21 

90 

24; 

21 

99 61 

470 

22 09 

00 

22 18 

41 

22 

27 84 

22 37 

29 

22 46 76 

22 

56 

25 

22 65 

76 i 

22 

75 

29 

22 

84 

84 

22 

94 41 

480 

23 04 

00 

23 13 

61 

23 

23 24 

23 32 

89 

23 42 56 

23 

52 

25 

23 61 

96 

23 

71 

69 

23 

81 

44 

23 

91 21 

490, 

24 01 

00 

24 10 

81 

24 

20 64 

24 30 

49 ! 

24 40 36 

24 

50 

25 

24 60 

16 

24 

70 

09 

24 

80 

04 

24 

90 01 

500 

25 00 

00 

25 10 

01 

25 

20 04 

25 30 

09 

25 40 16 

25 

50 

25 

25 60 

36 

25 

70 

49! 

25 

80 

64 

25 

90 81 

.610 

26 01 

00 

26 11 

21 

26 

21 44 

26 31 

69 

20 41 96 

26 

52 

25 

26 62 

56 

26 

72 

89 

26 

83 

24 

26 

93 61 

,'620 

27 04 

00 

27 14 

41 

27 

24 84 

27 35 

29 

27 45 76 

27 

56 

25 

27 66 

76 

27 

77 

29' 

27 

87 

84 

27 

98 41 

,■'530 

28 09, 

00 

28 19 

61 

28 

30 24 

28 40 

89 

28 51 50 

28 

62 

26 

|28 72 

96 

28 

83 

09 

28 

94 

44! 

29 

06 2! 

,'640 

29 16 

00 

29 28 

81 

29 

37 64 

29 48 

49 

29 59 36 

29 

70 

25 

129 81 

16 

29 

92 

09 

30 

03 

04 : 

30 

14 01 
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Appendix Table Ylll— Continued 


Squares of the Natural Numbers from 100 to 999 


n 


n 


[ n 

+ 

1 

i n 

! 

+ 

2 

n 

+ 

3 1 

— csyuAJt 
1 n +4 1 

cjti: 

n 

V. 

+ 

f— 

5 

n 

+ 

6 

n 

+ 

7 

n 

+ 

8 

n. 

4- 

9 

550 

30 

25 

00 

30 

36 

01 

30 

47 

04 

[30 

58 

09^ 

30 

69 

16 

30 

80 

25 

30 

91 

36 

31 

02 

49 

31 

13 

64 

31 

24 

81 

560 

31 

36 

00 

|31 

47 

21 

^31 

58 

44 

31 

69 

69 

31 

SO 

96 

31 

92 

25 

32 

03 

56 

32 

14 

89 

32 

26 

24 

32 

37 

61 

570 

32 

49 

00 

i32 

60 

41 

32 

71 

84 

132 

S3 

29 

32 

94 

76 

33 

06 

25 

33 

17 

76 

33 

29 

29 

33 

40 

84 

33 

52 

41 

580 

33 

64 

00 

33 

75 

61 

33 

87 

24 

33 

98 

89 

34 

10 

56 

34 

22 

25 

34 

33 

96 

34 

45 

69 

34 

57 

44 

34 

69 

21 

590 

34 

81 

00 

i34 

92 

SI 

35 

04 

64 

35 

16 

49 

35 

28 

36 

35 

40 

25 

35 

52 

16 

35 

64 

09 

35 

76 

04 

35 

88 

01 

600 

36 

00 

00 

[36 

12 

01 

36 

24 

04 

[36 

36 

09 

36 

48 

16 

36 

60 

25 

36 

72 

36 

36 

84 

49 

36 

96 

64 

37 

08 

81 

610 

37 

21 

00 

i37 

33 

21 

:37 

45 

44 

37 

57 

69 

37 

69 

96 

37 

82 

25 

37 

94 

56 

38 

06 

89 

38 

19 

24 

38 

31 

61 

620 

138 

44 

00 

138 

56 

41 

38 

68 

84 

38 

SI 

29 

38 

93 

76 

39 

06 

25 

39 

18 

76 

39 

31 

29 

39 

43 

84 

39 

56 

41 

630 

39 

69 

00 

39 

SI 

61 

39 

94 

24 

40 

06 

89 

40 

19 

56 

40 

32 

25 

40 

44 

96 

40 

57 

69 

40 

70 

44 

40 

S3 

21 

640 

40 

96 

00 

41 

08 

81 

41 

21 

64 

41 

34 

49 

41 

47 

36 

41 

60 

25 

41 

73 

16 

41 

86 

09 

41 

99 

04 

42 

12 

01 

650 

42 

25 

00 

'42 

38 

01 

42 

51 

04 

42 

64 

09 

42 

77 

16 

42 

90 

25 

43 

03 

36 

43 

16 

49 

43 

29 

64 

43 

42 

81 

660 

43 

56 

00 

|43 

69 

21 

43 

82 

44 

43 

95 

69 

44 

OS 

96 

44 

22 

25 

44 

35 

56 

44 

48 

89 

44 

62 

24 

44 

75 

61 

670 

44 

89 

00 

45 

02 

41 

145 

15 

84 

45 

29 

29 

45 

42 

76 

45 

56 

25 

45 

69 

76 

45 

83 

29 

45 

96 

84 

46 

10 

41 

680 

46 

24 

00 

46 

37 

61 

'46 

51 

24 

46 

64 

89 

46 

78 

56 

46 

92 

25 

47 

05 

96 

47 

19 

69 

47 

33 

44 

47 

47 

21 

690 

47 

61 

00 

47 

74 

81 

47 

88 

64 

48 

02 

49 

48 

16 

36 

48 

30 

25 

48 

44 

16 

,48 

58 

09 

48 

72 

04 

,48 

86 

01 

700 

49 

00 

00 

49 

14 

01 

!49 

28 

04 

49 

42 

09 

49 

56 

16 

49 

70 

25 

49 

84 

36 

149 

98 

49 

60 

12 

64 

ISO 

26 

81 

710 

50 

41 

00 

50 

55 

21 

50 

69 

44 

50 

83 

69 

50 

97 

96 

51 

12 

25 

51 

26 

56 

!51 

40 

89 

51 

55 

24 

51 

69 

61 

720 

51 

84 

00 

51 

98 

41 

52 

12 

84 

52 

27 

29 

52 

41 

76 

52 

56 

25 

52 

70 

76 

52 

85 

29 

52 

99 

84 

53 

14 

41 

730 

53 

29 

00 

53 

43 

61 

53 

58 

24 

53 

72 

89 

53 

87 

56 

54 

02 

25 

64 

16 

96 

54 

31 

69 

64 

46 

44 

54 

61 

21 

740 

54 

76 

00 

54 

90 

81 

55 

05 

64 

55 

20 

49 

55 

35 

36 

55 

50 

25 

55 

65 

16 

55 

80 

09 

55 

95 

04 

56 

10 

01 

750 

56 

25 

00 

56 

40 

01 

56 

55 

04 

56 

70 

09 

56 

85 

16 

57 

00 

25 

[57 

15 

36 

57 

30 

49 

57 

45 

64 

57 

60 

81 

760 

57 

76 

OOl 

57 

91 

21 

58 

06 

44 

58 

21 

69 

58 

36 

96 

58 

52 

25 

58 

67 

56 

58 

82 

89 

58 

98 

24 

59 

13 

61 

770 

59 

29 

ool 

59 

44 

41 

59 

59 

84 

59 

75 

29 

59 

90 

76 

60 

06 

25 

60 

21 

78 

60 

37 

29 

60 

62 

84 

60 

68 

41 

780 

60 

84 

ooi 

60 

99 

61 

61 

15 

24 

61 

30 

89 

61 

46 

56 

61 

62 

25 

61 

77 

96 

61 

93 

69 

62 

09 

44 

62 

25 

21 

790 

62 

41 

00 i 

62 

56 

81 

62 

72 

64 

62 

88 

49 

63 

04 

36 

63 

20 

25 

63 

38 

16 

63 

52 

09 

63 

68 

04 

63 

84 

01 

800 

64 

00 

00: 

64 

16 

01 

64 

32 

04 

64 

48 

09 

64 

64 

16 

64 

80 

25 

64 

96 

36 

65 

12 

49 

[65 

28 

64 

65 

44 

81 

810 

65 

61 

00 

65 

77 

21 

65 

93 

44 

66 

09 

69’ 

66 

25 

96 

66 

42 

25 

66 

68 

56 

66 

74 

89 

66 

91 

24' 

67 

07 

61 

820 

67 

24 

00 

67 

40 

41 

67 

56 

84 

67 

73 

29’ 

67 

89 

76 

68 

06 

25 

68 

22 

76 

68 

39 

29 

l68 

65 

84 i 

68 

72 

41 

830 

68 

89 

00 

69 

05 

61 

69 

22 

24 

69 

38 

89 

69 

55 

56 

69 

72 

25 

69 

88 

96 

70 

05 

69 

70 

22 

44 

70 

39 

21 

840 

70 

56 

00 

70 

72 

81 

70 

89 

64 

71 

06 

49: 

71 

23 

36 

71 

40 

25 

71 

57 

16 

71 

74 

09 

71 

91 

04 

72 

08 

01 

850 

72 

25 

00 

72 

42 

01 

72 

59 

04 

72 

76 

09 

72 

93 

16 

73 

10 

25 

73 

27 

36 

73 

44 

49 

73 

61 

64 

73 

78 

81 

860 

73 

96 

00 

74 

13 

21 

74 

30 

441 

74 

47 

69 

74 

64 

96 

74 

82 

25 

74 

99 

56 

75 

16 

89 

75 

34 

24 

75 

51 

61 

870 

75 

69 

00 

75 

86 

41 

76 

03 

84' 

76 

21 

29 

76 

38 

76 

76 

56 

25 

76 

73 

76 

76 

91 

29 

77 

08 

84 

77 

26 

41 

880 

77 

44 

00 

77 

61 

61 

77 

79 

24i 

77 

96 

89 

78 

14 

56 

78 

32 

25 

78 

49 

96 


67 

69 

78 

85 

44 

79 

03 

21 

890 

79 

21 

00 

79 

38 

'81 

79 

56 

64i 

79 

74 

49 

79 

92 

36 

SO 

10 

25 

80 

28 

16 

80 

46 

09 

80 

64 

04 

80 

82 

01 

900 

81 

00 

00 

81 

18 

01 

SI 

36 

04 

81 

54 

09 

81 

72 

16 

81 

90 

25 

82 

08 

36 

82 

26 

49 

82 

44 

64 

82 

62 

81 

910 

82 

81 

00 

82 

99 

21 

83 

17 

44. 

83 

35 

69 

83 

53 

96 

83 

72 

25 

83 

90 

56 

84 

08 

89 

84 

27 

24 

84 

45 

61 

920 

84 

64 

00 

84 

82 

41 

85 

00 

84 

85 

19 

29 

85 

37 

76 

85 

56 

25 

85 

74 

76 

85 

93 

29 

86 

11 

84 

86 

30 

41 

930 

86 

49 

00 

86 

67 

61 

86 

86 

24 

87 

04 

89 

87 

23 

56 

87 

42 

25 

87 

60 

961 

87 

79 

69 

87 

98 

44 

88 

17 

21 

940 

88 

36 

00 

88 

54 

81 

88 

73 

64 

88 

92 

49 

89 

11 

36 

89 

30 

25 

89 

49 

16 

89 

68 

09 

89 

87 

04 

90 

06 

01 

950 

90 

25 

00 

90 

44 

01 

90 

63 

04 

90 

82 

09 

91 

01 

16 

91 

20 

25 

91 

39 

36 

91 

58 

49 

91 

77 

64 

91 

90 

81 

960 

92 

16 

00 

92 

35 

21 

92 

54 

44 

92 

73 

69 

92 

92 

96 

93 

12 

25 

93 

31 

56 

93 

SO 

89 

93 

70 

24 

93 

89 

61 

970 

94 

09 

00 

94 

28 

41 

94 

47 

84 

94 

67 

29 

94 

86 

76 

95 

06 

25 

95 

25 

76 

95 

45 

29 

95 

64 

84 

95 

84 

41 

980 

96 

04 

00 

96 

23 

61 

96 

43 

24: 

96 

62 

89 

96 

82 

56 

97 

02 

25 

97 

21 

96 

97 

41 

69 

97 

61 

44 

97 

81 

21 

990 

98 

01 

00 

98 

20 

81 

98 

40 

64| 

98 

60 

49 

98 

80 

36 

99 

00 

25 

99 

20 

16 

99 

40 

09 

99 

60 

04 

99 

80 

01 


707 



Appendix Table IX 


Sums of the First Three Potvers of the Natural Ntmibers from 1 to 60 


n 

:s(«) 



2(n) 

S(«=) 

S(n=) 

1 

1 

1 

1 26 

351 

6 201 

123 201 

2 

3 

5 

9 27 

, 378 

6 ' 930 

142 884 

3 

6 

14 

36 28 

406 

7 714 

164 836 

4 

10 

30 

100 29 

435 

8 555 

189 225 

5 

15 

55, 

225 30 

465 

9 455 

216, 225 

6 

21 

91 

441 31 

496 

10 416 

246 016 

7 

28 

140 

784 32 

528 

11 440 

278 784 

8 

36 

204 

1 296 33 

561 

12 529 

314 721 

9 

45 

285 

2 025 34 

595 

13 685 

354 025 

10 

55 

385 

3 025 35 

630 

14 910 

396 900 

11 

66 

506 

4 356 36 

666 

16 206 

443 556 

12 

78 

650 

6 084 37 

703 

17 575 

494 209 

13 

91 

819 

8 281 38 

741 

19 019 

549 081 

14 

! 105 

1 015 

11 025 39 

780 

20 540 

608 400 

15 

120 

1 1 240 

: 14 400 40 

820 

22. 140 

672 400 

16 1 

136 

1 496 

18496 41 

1 861 

1 23 821 

741 321 

17 ! 

153 

1 785 

23 409 42 

903 

1 25 585 

815 409 

18 : 

171 

2 IGCr 

29 241 43 

946 

27 434 

, . 894 '916 ■ 

19 i 

190 

2 470 

36 100 44 

990 

29 370 

980 100' 

20 

210, 

2 870 

44 100 45 

1 035 

31 395 

1 071 225 

21 

231 

3 311 

53 361 46 

1 081 

33 511 ' 

1168 561 , 

22 

253 

3 795 

64 009 47 

M2S 

35 720 

1 272 384: 

23 

276 

4 3,24 ^ 

76 176 48 

1 176 

38 024 .. 

■1 ,382 976 

.24', 

-300 

4 900 

90 000 49 

1 225 

40 425 

1,500 625 


■■"■■^.325 : 

5 525 

105 625 50 

1 275 

42 925 

1 625 .625 


7m 



Appendix Table X 

f ive-Place Logarithms of Numbers 

100-160 


N 

L 0 

1 

S 

S 

4 

5 

6 

f 

8 

9 

Prop . , Pts . 

100 

00 000 

043 

087 

130 

173 

217 

260 

303 

346 

389 





101 

432 

475 

518 

561 

604 

647 

689 

732 

775 

817 





102 

860 

903 

945 

988 

*030 

*072 

*115 

*157 

*199 

*242 



43 


103 

01 284 

326 

368 

410 

452 

494 

536 

578 

620 

662 



42 

104 

703 

745 

787 

828 

870 

912 

953 

995 

*036 

*078 

1 

4.4 

4 3 

4.2 


02 119 









2 

8.8 

8 6 

8.4 

105 

160 


243 

284 

325 

366 

msm 

449 

490 

3 

13.2 

12.9 

12.6 

106 

531 

572 

612 

653 

694 

735 

776 

816 

857 

898 

4 

17.6 
22.0 
2P. 4 

17.2 

16.8 

107 

938 

979 


*060 

*100 

*141 

*181 

*222 

*262 

*302 

6 

21.5 

25.8 

30.1 

21.1) 

25.2 

2«.4. 

108 

03 342 

383 

423 

463 

503 

543 

583 

623 

663 

703 

7 

sols 

109 

743 

782 

822 

862 

902 

941 

981 

*021 

*060 

*100 

S 

35.2 

34 . 4133 , t ) i 









9 

39.6 

7 37 12 

110 

04 139 

179 

218 

258 

297 

336 

376 

415 

454 

493 




111 

532 

571 

610 

650 

689 

727 

766 

805 

844 

883 





112 

922 

961 

999 

*038 

*077 

*115 

*154 

*192 

*231 

*269 


41 

40 

39 

113 

05 308 

346 

385 

423 

461 


538 

576 

614 

652 



4.0 

8.0 

114 

690 

729 

767 

805 

843 

881 

918 

956 

994 

*032 

1 

2 

4.1 

8.2 

3.9 

7.8 

115 

06 070 

108 

145 

ISS 

221 

258 

296 

333 

371 


3 

4 

12.3 

16.4 

12.0 

16.0 

11.7 

15.6 

116 

446 

483 

521 

558 

595 

633 

670 

707 

744 

781 

5 

20.6 

20.0 

X9.5 

117 

819 

856 

893 

930 

967 

Eiligl 

*041 

*078 

*115 

*151 

6 

24.6 

24.0 

23.4 

118 

07 188 

225 

262 

298 

335 

372 

408 

445 

482 

518 

7 

28.7 

28.0 

27.3 

119 

656 

591 

628 

664 

m 

737 

773 

809 

846 

882 

8 

9 

32.8 

36.9 

32.0 

36.0 

31.2 

35.1 

120 

918 

954 

990 

*027 

*063 

*099 

*135 

*171 

£3^ 

*243 





121 

08 279 

314 

350 

386 

422 

458 

493 

529 

565 

600 



37 


122 

636 

672 

707 

743 

778 

814 

849 

884 


955 


38 

36 

123 

991 

*026 

*061 

*096 


ii^ 


*237 

*272 

*307 

1 

.8 8 

37 

3 6 

124 

09 342 

377 

412 

447 

482 

517 

552 

587 

621 

656 

2 

7.6 

7.4 

7.2 












3 

11.4 

11.1 

10.8 

125 

691 

726 

760 

795 

830 

864 

899 

934 

968 

msm 

4 

15.2 

14.8 

14.4 

126 

10 037 

072 

106 

140 

175 

209 

243 

278 

312 

346 

5 

19.0 

22.8 

26.6 

18.5 

22.2 

25.9 

18.0 

21.6 

25.2 

127 

380 

415 

449 

483 

517 

551 

685 

619 

653 

687 

7 

128 

721 

755 

789 

823 

857 

890 

924 

958 

992 

*025 

8 

30.4 

29.6 

28.S 

129 . 

11 059 

093 

126 

160 

193 

227 

261 

294 

327 

|61 

9 

34.2 

33.3 

32.4 

130 

394 

428 

461 

494 

00 

561 

594 

628 

661 

694 





131 

727 

760 

793 

826 

860 

893 

926 

959 

992 

*024 

] 

35 

34 

3E 

132 

12 057 

090 

123 

156 

189 

222 

254 

287 

320 

352 


3.3 

133 ' 

385 

418 

450 

483 

516 

548 

581 

613 

646 

678 

1 ■ 

3.5 

3.4 

134 

710 

743 

775 

808 

840 

872 

905 

937 

969 


2 

3 

7.0 

10.5 

6.8 

10.2 

6.6 

9.9 

135 

13 033 

066 

098 

130 

162 

194 

226 

258 

290 

322 

4 

6' 

14.0 

17.5 

13.6 

17,0 

13.2 

16.5 

'■ 136 : 

■ •354 

386 

418 

450 

481 

513 

645 

577 

609 

640 

6 

21.0 

20.4 

19.8 

,137 

. 672 

704 , 

735 

767 

799 

830 

862 

893 

925 

956 

7 

24.6 

23.8 

23.1 

• 138 ':^ 

988 

*019 

*051 

*082 

*114 

*143 

*176 


*239 

*270 

8 

9 

28.0 

31.6 

27.2 

30.6 

26.4 

29.7 

: 139 ; 

,14301 

■333 1 

364 

395 i 

1 

426 

457 

489 

620 

651 

582 

440 

613 

644 

675 

mi 

737 

768 

799 

829 

00 

s 

891 


sa 

31 


. 141 

922 

953 

983 

*014 

*045 

*076 

*106 

*137 

*168 

*198 


■ 30 

142 

15 229 

259 '1 

290 

320 

351 

381 

412 

442 

473 


1 

2 

3.2 

6.4 

3.1 

6.2 

3.0 

6.0 

143 

534 

564 ! 

594 

625 

655 

685 

716 

746 

776 

806 

144 

836 

866 

897 

927 

957 

987 

*017 

*047 

*077 

*107 

3 

9.6 

, 9.3 

9.0 











4 

12.8 

12,4 

12.0 

145 

16 137 

167 

■197 

227 , 

256 

286 

316 

346 

376 

,406 

S 

16.0 

l,6.,6 

15.0 

X8.0 

21 0 

146 

• 435 

465 

, 495 . 

,524 

554 , 

684 - 

613 


673 

702 

6 

(jr 

19.2 
22 4 

18.6 
21 7 

147 

■■■• 732 . 

761 

791 

820 

850 

879 

909 

938 

967 

997 

■ ■■ 8 " 

25 >> 

24.8 

24.0 

148 

17 026 

056 

085 

114 

143 

173 

202 

231 

260 

289 

: ® 

28.8 

27.9 

27.0 

149 

319 

348 

■.'377 

406 

435 

464 

493 

522 

551 

580 





160 

609 , 

63S 

667 

696 

.725 

' 754 

782 

Sll 

840 

$89 





If ' 

L 0 j 

I;"! 

.■ 

■■®' .' 

; 4 

, $ . 

6 

f 

B 

B 

BBEHI 






















Appendix Table X 




7 


$ 




811 

840 

869 

*099 

*127 

* 156 "’ 

384 

412 

441 

667 

696 

724 

949 

977 

*005 

229 

257 

285 

507 

535 

562 

7S3 

811 

838 

*058 

*085 

*112 

330 

358 

385 

602 

629 

656 

871 

" 898 " 

925 

*139 

*165 

*192 

405 

431 

458 

669 

696 

722 

932 

958 

985 

194 

220 

246 

453 

479 

505 

712 

737 

763 

968 

994 

*019 

223 

249 

274 

477 

502 

528 

729 

754 

779 

980 

*005 

*030 

229 

254 

279 

477 

502 

527 

724 

748 

773 

969 

993 

*018 

212 

237 

261 

455 

479 

503 

696 

720 

744 

935 

959 

983 " 

174 

198 

221 

411 

435 

458 

647 

670 

694 

881 

905 

928 

*114 

*138 

*161 

346 

370 

393 

577 

600 

623 

807 

830 

852 

*035 

*058 

*081 

262 

285 

307 

488 

511 

533 

713 

735 

758 

937 

969 

9S1 

159 

181 

203 

380 

403 

425 

601 

623 

645 

820 

842 

863 

*038 

*060 

*081 

255 

276 .' 

: 298 


Prop. Pt8. 


1 2.9 2.8 

2 6.8 6.6 

3 8.7 8.4 

4 11.6 11.2 

5 14.5 14.0 

6 17.4 16.8 

7 20.3 19.6 

8 23.2 22.4 

9 26.1 26.2 


1 2.7 2.0 

2 6.4 6.2 

3 8.1 7.8 

4 10.8 10.4 
6 13.6 13.0 

6 16.2 15.6 

7 18,9 18.2 

8 21.6 20.8 

9 24.3 23.4 


1 2.4 2.3 

2 4.8 4.6 

3 7.2 6.9 

4 9.6 9.2 

5 12.0 11.5 

6 14.4 13.8 

7 16.8 16.1 

8 19.2 18.4 

9 21 . 620,7 


1 2.2 . 2.1 
2 ' 4.4 4 , 2 . 

3 6,6 6.3 

.4 8.8 8.4 

5 . 11.0 10.6 
6 13.2 12.6 
7 '. 16 . 414 . 7 .. 

8 17.6 16.8 

9 19.8 18 . 9 .' 
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Five-Place Logarithms of Numbers 

200-260 



711 
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Five~Pla€e Logarithms of Numbers 


260-300 


Prop. Pts. 



811 

829 

846 

863 

881 

898 

985 

*002 

*019 

*037 

*054 

*071 

157 

175 

192 

209 

226 

243 

329 

346 

364 

381 

398 

415 

500 

518 

535 

552 

569 

586 

671 

688 

705 

722 

739 

756 

841 

858 

875 

892 

909 

926 

*010 

*027 ^ 

*044 

*061 

*078 

*095 

179 

196 

212 

229 

246 

263 

347 

363 

380 

397 

414 

430 









614 

631 

647 

7S0 

797 

814 

946 

963 

979 

*111 

*127 

*144 

275 

292 

308 

439 

455 

472 

602 

619 

635 

765 

7S1 

797 

1 927 

943 

959 

j|j^ 

:*104 

*120 

1^1 

265 

2S1 

409 

425 

441 

569 

584 

600 

727 

743 

' 759 

886 ‘ 

902 

917 

*044 

*059 

*075 

201 

217 

232 

358 

373 

389 

514 

629 

545 

669 

685 

700 




04 

00 

840 

856 



727 741 756 770 784 799 813 828 842 


loge = .43429 
712 
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Five-Ph^;e Logarithms of Numbers 


300-360 


Prop-^Pts. 


813 

828 

842 

958 

1 972 

986 

101 

^ 116 

130 

244 

259 

273 

387 

401 

416 

630 

' 544 

558 

671 

686 

700 

813 

827 

841 

954 

968 

982 

*094 

*108 

*122 




234 

248 

262 


529 

542 

656 

569 

583 

596 

610 

623 

637 

664 

678 

691 

705 

718 

732 

745 

759 

772 

799 

813 

826 

840 

853 

866 

880 

893 

907 

934 

947 

961 

974 

987 

*001 

*014 

*028 

*041 

068 

081 

095 

108 

121 

135 

148 

162 

175 

202 

215 

228 

242 

' 256 

268 

282 

295 

308 

335 

348 

362 

375 

388 

402 

415 

428 

441 

468 

481 

495 

508 

521 

I 534 

548 

561 

574 

601 

614 

627 

640 

654 

667 

680 

693 

706 

733 

746 

759 

772 

786 

799 

1 812 

825 

838 


930 

943 

957 

970 

061 

*075 

*088 

*101 

192 

205 

218 

231 

323 

336 

349 

362 

453 

466 

479 

492 

582 

595 

608 

621 

711 

724 

737 

750 

840 

853 

866 

879 

969 

982 

994 

*007 

097 

110 

122 

135 


494 606 518 



m 

1 

1.5 

2 

3.0 

3 

4.5 

4 

6.0 

6 

7.5 

6 

9.0 

7 

10.5 

8 

12.0 

9 

13.5 


M 

1 

1.4 

2 

2.8 

3 

4.2 

4 

5.6 

5 

7.0 

6 

8.4 

7 

9.8 

8 

11.2 

9 

12.6 


IZ 

1 

1.3 

2 

2.6 

3 

3.9 

4 

6.2 

5 

6.5 

6 ' 

7.8 

' 7 

9.1 

8 

10.4 

' 9 

11.7 


m' 

1 

1.2 

2 

2.4 

3 

3.6 

4 

4.8 

6 

6.0 

6 

■7.2 „ 

7 

8,4 , 

8 

9.6 

9 

10.8 

: '.Prop, Ptfc ' , 


log ir =* .49715 
:7is; 
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Five-Place Logarithms of Numbers 

360-400 




IS 

1 

1.3 

2 

2.6 

3 

3.9 

4 

5.2 

5 

6.5 1 

6 

7.8 '■ ■ 

- 7 

9.1 

8 

10.4 

9 

11.7 


12 

1 

1.2 

2 

2.4 

3 

3.6 

4 

4.8 

5 

6.0 

6 

7.2 

7 

8.4 

8 

9.6 

9 

10.8 


11 

1 

1.1 

2 

2.2 

3 

3.3 

4 

4.4 

5 

6.5 

6 

6.6 

7 

7,7 

8 

8.8 

9 

9.9 


M 

1 

t.O 

2 

2.0 , 

. 3 

3.0 , ■ 

. 4 

4.0 

5 

6.0 

6 

6.0 

7' 

7.0 

" :,8. 

8.0 

9 

9.0 

Frop.Pts. : 


714 
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Five-Place Logarithms of Numbers 


4(KM60 



715 


















Appendix ■ Table X 


Fim-Place Logarithms of Numbers 

450-500 



485 495 504 

581 591 800 

677 686 696 

772 782 792 

S6S 877 887 

963 973 982 

’5'058 *06S ‘‘'077 
143 j 153 162 172 

238 [ 247 257 265 

~ 342 " ssF ’lei" 
436 445“ 455 

530 539 549 

624 633 642 

7X7 '727 736 

811 820 829 

904 913 922 

997 *006 *015 
089 099 108 

182 191 201 


274 284 293 


367 376 385 

459 468 477 

550 560 569 

642 651 660 

733 742 752 

825 834 843 

916 925 934 

*006 *015 *024 
097 106 115 


726 735 744 

815 824 833 

904 913 922 

993 *002 *011 




10 

1 

1.0 

2 

2.0 

3 

3.0 

4 

4.0 

5 

5.0. 

6 

6,0 

7 

7.0 

8 

8.0 

9 

1 9.0 


9 

1 

0.9 

2 

1..8 

3 

2.7 

4 

3.6 

S 

4.5 ■ 

6 

5.4 

7 

6.3 

8 

7.2 

9 

S.i 

' 1 

a 

1 

0.8 . 

1 ■ ■ 2 

1.6 

■ : 3 

2.4 

'4 

,3.2 

• 5 

4.0 

6 

4.8 

7 

5.6 

: S 

6.4 

9 

7.2 


716 
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Five-Place Logarithms of Numbers 

600-550 
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Five-Place. Logarithms of NunAers 

660-600 
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Fiv&-Place Logarithms of Numbers 

600-650 



Prop. Pts, 


8 

1 

0,S 

2 

1.6 

3 

2.4 

4 

3.2 

5 

4,0 

6 

4.8 

7 

5.6 

8 

6.4 

9 

7.2 


1 

1 

0.7 

2 

1.4 

3 

2.1 

4 

2.8 

5 

3.5 

6 

4.2 

7 

4.9 

8 

6.0 

9 

6.3 


S 

1 

O.G 

2 

1.2 

3 

1.8 

4 

2.4 

5 

3.0 

6 

3,6 

7 

4.2 

8 

4.8 

0 

6.4 


719 
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Fwe-Pluce Logarithms of Numbers 


660-700 


338 

345 

351 

405 

411 

418 

471 

478 

485 

538 

544 

551 

604 

611 

617 

671 

677 

684 

737 

743 

750 

803 

809 

816 

869 

875 

882 

935 

941 

948 



885 

891 

897 

904 

B 

mm 

923 

948 

954 

960 

967 

973 

979 

985 

toil 

017 

023 

029 

036 

042 

048 

073 

OSO 

086 

092 

098 

105 

111 

136 

142 

148 

165 

161 

167 

173 

198 

205 

211 

217 

223 

230 

236 

261 

267 

273 

280 

286 

292 

298 

323 

330 

336 

342 1 

348 

354 

361 

386 

392 

398 

404 

410 

417 

423 

448 

454 

460 

466 

473 

479 

485 

'610 

516 

522 

528 

535 

541 

647 


929 935 942 
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Five-Place Logarithms of Numbers 


700-750 



516 

522 

528 

535 

541 

547 

653 

559 

566 

578 

! 584 

590 

597 

603 

609 

615 

621 

628 

640 

646 

652 

658 

665 

671 

677 

683 

689 

702 

! 70S 

714 

720 

726 

733 

739 

745 

751 

763 

770 

776 

782 

788 

794 

800 

807 

813 

825 

831 

837 

844 

850 

856 

862 

868 

874 

887 

893 

899 

905 

911 

917 

924 

930 

936 

948 

954 

960 

967 

973 

979 

985 

991 

997 

009 

016 

022 

028 

034 

040 

046 

052 

058 

071 

077 

083 

089 

095 

101 

107 

114 

120 

132 

138 

144 

150 

1 156 

163 

169 

175 

181 

193 

199 

205 

211 

1 217 

224 

230 

236 

242 

254 

260 

266 

272 

; 278 

285 

291 

297 

303 

315 

321 

327 

333 

; 339 

345 

352 

358 

364 

376 

382 

388 

394 

1 400 

406 

412 

418 

425 

437 

443 

449 

455 

461 

467 

473 

479 

485 

497 

503 

509 

516 

522 

528 

534 

540 

546 

558 

564 

570 

576 

5S2 

588 

594 

600 

606 

618 

625 

631 

637 

643 

649 

655 

661 

667 

679 

685 

691 

697 

703 

709 

715 

721 

727 



398 

404 

410 

415 

421 

427 

433 

439 

445 

457 

463 

469 

475 

481 

487 

493 

499 

604 

516 

522 

528 

534 

540 

546 

552 

558 

564 

576 

581 

587 

593 

599 

605 

611 

617 

623 

635 

641 

646 

652 

658 

664 

670 

676 

682 

694 

700 

705 

711 

717 

723 

729 

735 

741 

753 

759 

764 

770 

776 

1 782 

788 

794 

800 

812 

817 

823 

829 

835 

i 841 

847 

853 

859 

870 

876 

882 

888 

894 

900 

906 

911 

917 

929 

935 

941 


’'~976' 


994 999 *005 *011 *017 *023 *029 *035 

052 058 064 070 075 081 087 093 

111 116 122 128 134 140 146 151 

169 175 ISl 186 192 198 204 210 

227 233 239 245 251 256 262 208 

286 291 297 303 309 315 320 326 

344 349 355 361 367 373 379 384 

402 408 413 419 425 431 437 442 

460 466 471 477 483 489 495 500 
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795 
852 
910 

758 1 967 

759 88024 



512 

518 

523 

529 

535 

541 

670 

576 

581 

587 

593 

599 

628 

633 

639 

645 

651 

656 

685 

691 

697 

703 

70S 

714 

743 

749 

754 

760 

766 

772 

800 

806 

812 

818 

823 

829 

858 

864 

869 

875 1 

881 i 

887 

915 

921 

927 

933 ! 

938 

944 

973 

978 

984 

990 ' 

996 

*001 

030 

036 

041 

047 

053 

058 

087 

093 

098 

104 

no 

116 

144 

150 

156 

161 

167 

173 

201 

207 

213 

218 

224 

230 

258 

264 

270 

275 

281 

287 



765 366 

766 423 

767 480 

768 536 

769 503 



209 215 221 

265 271 276 

321 326 332 

376 382 387 

432 437 443 

487 492 498 

642 548 653 

597 603 609 

653 658 664 

70S 713 719 


226 232 

282 . 287 
337 343 
393 398 

448 454 

504 509 

559 564 

614 620 

669 675 

724 730 


237 243 

293 298 

348 354 

404 409 

459 465 

515 520 

570 575 

625 631 
680 686 
735 741 


823 

829 

834 

840 

845 

851 

878 

883 

889 

894 

900 

905 

933 

938 

944 

949 

955 

960 

988 

993 

998 

*004 

*009 

*015 

042 

048 

053 

059 

064 

069 

097 

102 

108' 

113 

119 

m 

151 

157 

162 

168 

173 

179 

206 ' 

211 

217 

222 

227 

233 

260 

266 

271 

276 

282 

287 

314 

320 

':325- 

331 

1 .336'. 

342 


248 

254 

260 

304 

310 

315 

360 

365 

371 

415 

421 

426 

470 

476 

481 

526 

531 

537 

581 

586 

592 

636 

642 

647 

691 

697 

702 

746 

752 

757 

801 

807 

812 



7m 
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Ftve-Place Logarithms of Numbers 

800-860 


If 

L 0 

1 

2 

S 

4 

5 

i 

7 

s 

9 

Prop. Pts. 

801 

802 

803 

804 

805 

806 

807 

808 
809 

363 

417 

472 

626 

580 

634 

687 

741 

795 

369 

423 

477 

531 

585 

639 

693 

747 

800 

320 

374 

428 

482 

536 

590 

644 

698 

752 

806 

325 

380 

434 

488 

642 

696 

650 

703 

757 

811 

^1 

385 

439 

493 

547 

601 

655 

709 

763 

816 

336 

390 

445 

499 

553 

607 

660 

714 

768 

822 

342 

347 

352 

358 


396 

450 

504 

558 

612 

666 

720 

773 

827 

401 

455 

509 

663 

617 

671 

725 

779 

832 

407 

461 

515 

569 

623 

677 

730 

784 

838 

412 

466 

520 

574 

628 

682 

736 

789 

843 


SiO 

849 

854 

859 

865 

m 

875" 

881 

886 

891 

897 


811 

812 

813 

814 

815 

816 

817 

818 
819 

902 

956 

91009 

062 

116 

169 

222 

275 

328 

907 

961 

014 

068 

121 

174 

228 

281 

334 

1 

918 

972 

025 

078 

132 

185 

238 

291 

344 

924 

977 

030 

084 

137 

190 

243 

297 

350 

929 

982 

036 

089 

142 

196 

249 

302 

355 

934 

988 

041 

094 

148 

201 

254 

307 

360 

940 

993 

046 

100 

153 

206 

259 

312 

365 

945 

998 

052 

105 

158 

212 

265 

318 

371 

950 

*004 

057 

110 

164 

217 

270 

323 

376 

i 

1 0.6 

2 1.2 

3 LS 

4 2.4 

5 3.0 

6 3.6 

7 4.2 

3 4.8 

9 6.4 

320 




397 

403 

408 

413 

418 

424 

429 

§ 

1 0.5 

2 LO 

3 1.5 

4 2.0 

6 2.5 

6 3,0 

7 3.6 

8 4.0 

9 4.6 

821 

822 

823 

824 

825 

826 

827 

828 
829 

434 

487 

540 

593 

645 

698 

751 

803 

855 

440 

492 

545 

598 

651 

703 

756 

808 

861 

445 

498 

551 

603 

656 

709 

761 

814 

866 

450 

503 

556 

609 

661 

714 

766 

819 

871 

455 

508 

561 

614 

666 

719 

772 

824 

876 

461 

514 

566 

619 

672 

724 

777 

829 

882 

466 

519 

572 

624 

677 

730 

782 

834 

887 

471 

524 

577 

630 

682 

735 

787 

840 

892 

477 

529 

582 

635 

687 

740 

793 

845 

897 

482 

535 

687 

640 

693 

745 

798 

850 

903 

330 

908 

913 

918 

924 

929 

934 

939 

944 

950 

955 

831 

832 

833 

834 

836 

836 

837 
■838 
839 

960 
92 012 
066 
■ 117 

169 

221 

273 

324 

376 

1 

971 

023 

075 

127 

179 

231 

283 

335 

387 

976 

028 

080 

132 

184 

236 

288 

340 

392 

981 

033 

085 

137 

189 

241 

293 

345 

397 

986 

038 

091 

143 

195 

247 

298 

350 

402 

I 

997 

049 

101 

153 

205 

257 

309 

361 

412 

*002 

054 

106 

158 

210 

262 

314 

366 

418 

♦Sof 

059 

111 

163 

215 

267 

319 

371 

423 


mm 

433 

438 

443 

449 

454 

459 

464 

469 

474 

841 

842 

843 
S44 

845 

846 
847. 

■ 848 
: 849 

480 
' 631 
683 
■634 

680 

737 

788 

840 

891 

485 

536 

588 

639 

691 

742 

793 

846. 

896 

490 

542 

593 

645 

696 

747 

799 

860 

901 

495 

547 

598 

650 

701 

752 

804 

855 

906 

I 

505 

557 

609 

660 

711 

763 

814 

865 

916 

511 

562 

614 

665 

716 

768 

819 

870 

921 

516 

567 

619 

670 

722 

773 

824 

875 

927 

521 

572 

624 

675 

727 

778 

829 

881 

932 

526 

578 

629 

681 

732 

783 

834 

886 

937 

fmi 

^mi 

947 

952 

957 

962 

967 

973 

978 

983 

988 

Lii 

L ^ 0 I 

■ 1 : 

2 

.3 

4 

§ 

H 

a 

m 


Prop. Pts. 


723 
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724 . 










H L 0 


Appendk Table X 

Five-Place Logarithms of Numbers 

900-960 


1 434 I 439 I 444 | 448 | 453 j 458 463 468 


506 511 516 

554 559 564 

602 607 612 

650 655 660 

698 703 708 

746 751 756 

794 799 804 

842 847 852 


938 

942 

947 

985 

i 990 

: 995 

*033 

1*038 

*042 

080 

1 085 

090 

128 

i 133 

137 

175 

180 

185 

223 

227 

232 

270 

275 

280 

317 

322 

327 

365 

369 

374 


900 

904 

909 

914 

918 

923 

928 

932 

937 

946 

951 

956 

960 

965 

970 

974 

979 

984 

993 

997 

*002 

*007 

*011 

*016 

*021 

*025 

*030 

039 

044 

049 

053 

058 

063 

067 

072 

077 

086 

090 

095 

100 

104 

109 

114 

118 

123 

132 

137 

142 

146 

151 

155 

160 

165 

169 

179 

183 

188 

192 

197 

202 

206 

211 

216 

225 

230 

234 

239 

243 

248 

263 

257 

262 

271 

276 

280 

285 

290 

294 

299 1 

304 

308 


345 350 354 


Prop. Pts. 


804 809 813 
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Abscissa, 9, 64 
Accuracy of estimation, 332 
Actuarial science, 272, 671 
Aggregate, 307; use in index numbers, 
165, 181; weighted, 193, 316 
Alfalfa yield, correlation with irriga- 
tion, 404 ff . 

American index numbers of wholesale 
prices, Bradstreet’s index, 166, 182, 
209, 211; Harvard index, 210; U. S. 
Bureau of Labor Statistics, 168, 172, 
176, 193, 216 ff. 

American Telephone and Telegraph 
Co., index of industrial activity, 
312, 390, 393; studj^ of frequency of 
telephone use, 440 
Amplitude of cycles, 238, 285 
Analysis of variance, see Variance 
analysis 

Anderson, Oskar, 454, 487 
Angell, James W., 270 
Anti-logarithm, 24 

Arbitrary origin, 351; in computing 
the mean, 106 

Areas of the normal curve, 436 ff., 699 
Arithmetic mean, 102; computation 
of, 103flf.; weighted, 104; charac- 
teristics of, 126, 134; as most prob- 
able value, 332; moments about, 
442; of the binomial distribution, 
433, 660; significance of the dif- 
ference between means, 481 fi. ; sig- 
nificance of, in small samples, 
605 ff.; use in correlation analysis, 
537; standard error of , 464 ff., 664 
Arithmetic series, 16, 28, 275 
Array, 52, 82; in computing the cor- 
relation coefficient, 345 
Artillery observations, 90 
Astronomical observations, 89 
Asymmetry, see Skewness 
Average, 86 ff., 99, 101 ff.; relations 
between the several averages, 133; 
moving average, 234 ff. ; of ratios 
to trend, 287 


Average of relative prices, 183 ff., 

196 ff. 

Average relationship between vari- 
ables, 328 

Bank clearings, as index of business 
conditions, 242 

Bar diagram, see Column diagram 
Barlow’s tables, 133 
Base period, 316 
Bean, Louis H., 564 
Beckett, S. H., 405 
Beta coefficients, 561 
Bias, of index numbers, 191, 195; of 
the correlation index, 412; in sam- 
pling, 461, 599, 611 
Binomial distribution, 429 ff., deriva- 
tion of the mean and standard 
deviation of, 660 
Birge, Raymond T., 469, 629 
Black, John D., 564 
Blakeman, John, 478 
Bowley, A. L., 159, 201; representa- 
tive sampling, 461; standard error, 
472 

Bradstreet’s index, 166, 182, 209; use 
in deflation„383 
Burns, Arthur F., 242, 308 
Business, 1; classes of activity, 1; 
quantitative character of its prob- 
lems, 3, 6 

Business cycle, 293 ff.; as indicated by 
moving averages, 242; as measured 
by production changes, 305; pre- 
war relation to stock-price cycles, 
390 ff . ; post-vrar relation , 396 ; dura- 
tion of, 467, 481; effect on price, 
475; see also Cyclical variation 

Census of manufactures, 309, 317, 371 

Center of gravity, 102 

Central ' tendency, '97, 99; meas'iires 

of, 101 ff. 

' Certainty, in probability theory, 426 

. Chaddock, R. E., 84 
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Chain indexj 216 

Chance^ law ofy see Normal law of 
error and Probability 
Characteristic, iogaritlimic, 24 
Charlier check, 149 
Charts, construction of, 32 ff.; for 
comparison of frequencies, 41 ff.; 
representation ■ of component parts, 
42 ff.; cumulative, 44 ff. ; Gannt 
progress chart, 46; see also Graphic 
presentation 

CM-square, 618; distribution of, 618, 
626; in testing goodness of fit, 
626 fi.; in testing homogeneity, 
633 f . ; in testing independence of 
principles of classification, 630 ff . ; 
table of values, 625, 703 
Classification of quantitative mate- 
rial, see Organization of data 
Classification, principles of, 53 ff . ; test- 
ing significance of, 494 ff.; testing 
mdependence of, 630 ff., 681 ff. 
Classified data, see Grouping of data 
Class interval, 53, 57 ff., 104, 347, 
357; in locating the mode, 117; in 
computing the standard deviation, 
149 

Coefficient of correlation, see Cor- 
relation coefficient 

Goefficient of multiple correlation, see 
Correlation, niuitipie 
Coefficient of regression, see Regres- 
sion 

Coefficient of variation^ 156 
Coin tossing, 91 

Column diagram, 41, 64 ff,, 66, 73, 91 
Commodities, included in ■ price- 
change study, 209 

Compound, interest, law of, 30 ff., 40; 
curve of, 267 

Concurrence of cycles, 390 ff. 
Constants, 12, 244 ff. 

Controls, in sampling procedure, 462 
Coordiiiategeometry,,Sff, ; 
Correction, of index numbers, 311; 

of the correlation index, 412; of the 
■ „ standard error, 542 ■ 

Correction factor, in computing the 
correlation coefficient, 339; in com- 
puting the mean, 106, 351, 392; see 
also Bias 

Correlation, coefficient of, 334 ff., 520;. 


calculation of, 337 ff., 353, 364, 648; 
product-moment method, 349 ff.; 
construction of table, 340 ff.; sum- 
mary ■ of correlation procedure, 
366 ff.; limitations of, 370; relation 
to correlation ratio, 422; tests for 
the significance of, 502 ff., 611; sig- 
nificance of difference of coef- 
ficients, 616; standard- error of, 474; 
derived from small samples, 610; 
weighted average of, 617, 618; table 
of significant values, 612, 701; 
table of relations to the z function, 
702 

Correlation, index of, 408 ff., 520; 
formula for, 408, 412, 647; sig- 
nificance of, 409; computation of, 
410, 412; standard error, 477 

Correlation, linear, 325 ff.; lines of 
regression, 359 ff . ; distortion in 
non-normal distributions, 372 ff . ; 
of grouped data, 340 ff . ; in the 
measurement of time sequence, 
389 ff.; see also Correlation, coeffi- 
cient of 

Correlation, multiple, 530 ff.; prelim- 
inary analysis, 533; use of multi- 
variate estimating equation, 536 ff . ; 
coefficient of, 543 ff.; correction 
for number of constants involved, 
544; test of significance of the 
coefficient of, 544; standard error of 
the coefficient of, 545; application 
of method, 547; limitations of proce- 
dure, 563; simplification of normal 
equations, 652 ff. 

Correlation, non-linear, 404 ff . ; use of 
reciprocal relations, 582; use of 
logarithmic relations, 575 ff. 

Correlation of time series, 380ff.; of 
secular trends, 381; of deviations 
from trend, 385; dangers of pro- 
cedure, 388, 389; concurrent cycles, 
391; use of moving average in, 
398; of short term fluctuations, 
398 ff. 

Correlation, partial, 584 ff.; relation 
to simple correlation, 549; system- 
atic' '.computation of coefficients 
of, 554ff,; standard error of the 
coefficient of, 560 

Correiation, rank, 374 ff., 521 
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Correlation ratio^ 413 519; for- 

mula for, 414; computation of, 415, 
417 If,; significance of, 421; cor- 
rection of, 421; relation to correla- 
tion coefficient, 422 
Cost of living index, 22 1 
Cotton statistics, 161; correlation of 
price and production, 384, 400 
Coverage, of production index num- 
bers, 316 
Cox, G. M., 688 
Criteria of curve type, 444 
Crum, W.L., 302 ‘ 

Cumulative charts, 44 ff.; arrange- 
ment of data, 77; frequency curve, 
80 

Cui’ve fitting, by least squares, 246 ff.; 

linear, '246; parabolic, 253, 260 ff.; 

. of linear business series, 257; use of 
logs in, 264, 269 

Curve type, . criteria of, 444; see also 
Functional relationship 
Cutts, Jesse M., 218 
Cycles, correlation of, 382, 389 
Cycles of reference, see Reference 
cycles 

Cyclical fluctuations, correlation of, 
382 ff.; see also Cyclical variation 
Cyclical variation, 230, 302,' 380, 526; 
removal by moving averages, 236, 
284;,. measurement of, 293, ff.; in 
industrial activity, 312 ff., 390 

Davenport, Donald H., 219, 254, 437, 
■■ 573 

Davenport, E.,'415 , 

Day, Edmund E., construction of. in- 
dex of physical volume, 310^ 

Decile, graphic location of, 114 
Deflation, in time series analysis, 279 
Degrees of freedom, . 'in variance 
analysis, 491,. 496, 504 ff., 512, 517, 
528; , in .statistical induction, . 604, 
.611 ■■ '. 

Degree of relationship, see Relation- 
' ship, measurement of , , 

Deming, W. E,, 469, 470; Chi-square 
., test„629^ : ■ 

Dennis, Samuel J., 218 . 

Dependent, variable, see Variable 
Depreciation, 79.; ■ 

Derivative, partial,. 639, 643: . 


Descartes, Rend, 8 

Description, of frequency distribu- 
. tions, 86, 137 ff., 448 ff.; methods 
of, 99 ff.; statistical, 452 ff. 

Deviate, normal, 437 
Deviation, 97; probable, 152; from 
trend, 263, 385, 395; from mean, 
347; vertical and horizontal, 363; 
from moving averages, 398; from 
means of arrays, 417; quartile, see 
Quartile deviation; mean, see Mean 
deviation; standard, see Standard 
deviation; root-mean-square, see 
Root-mean-square deviation 
Differences, finite, 275 
Discount rates, relation ■ between, 
340ff.,361ff. 

Dispersion, 99, 115, 137, 414; zone of , 
89 ff., 349; measures of, 137 ff. , 
330, 335; in correlation analysis, 
490; test of differences in disper- 
sion, 492; see also , Variation and 
Scatter 

Distribution, frequency, 50 ff.; de- 
scription of, 137 ff.; general char- 
acteristics of, 97 ff.; of income, 71; 
of sawmills, 84; of heights, 87; of 
astronomical errors, 89; of artillery 
shots, 90; of coin throws, 91; of 
economic data, 93 ff . ; of exclia'iige 
rates, 96; of wage earners, 96; of 
bonds, 116; of stock prices, 125; 
see also List of charts 
Doolittle solution, of normal equa- 
tions, 540, 655 

Dow-Jones: index number, 393 

Edgeworth, F. Y., 204; binomial ex- 
pansion, 432 

Elderton, W. P., Ghi-sqiiare table, 624 
Equation of reg,ression, see Regression 
Error, normal curve of, see Normal 
law of error 

Error, samplmg,;sec Standard, error 
Estimate,, making of, 332 ff., 566 ff.; 

zone of, 349, 571 ff., 590 ff. , . 
Exchange rates, distribution of, 94 
Expected value, 294 
Exponential ; curve, 19, 28, 258, ,■ 266, 
271; modified, .272, ,667; ' .see also 
Logarithmic curve 
Exponent, logarithmic, 23 
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Exports, .statistics of, 36, 38 
Extrapolation, 264, 277, 675 
Ezekiel, Mordecai, 412, 477, 547, 564,. 
,652; multiple correlation analysis, 
537 ff.; correction of standard error, ■ 
542;, correction of the correlatio.n 
coeffici,eiit, 544 

Factor reversal test, 199 
Faikner, Helen B., 288 
Farm price index number, 222 ff. 
Federal Reserve Board, index of pro- 
duction, 316 ff. 

Fisher, Arne, 448, 

Fisher, Irving, 181; time reversal test, 
190; weighted index numbers, 195, 
196, 204; factor reversal test, 199; 
ideal index number, 201 
Fisher, R. A., 270, 479, 603; statistical 
population, 456; null h^^pothesis, 
475; analysis of variance, 490. ff.; 
■z table, 499, 704-5; extension of z 
' table, 518; t table, 603, 700; sig- 
nificance of the correlation coeffi- 
cient, 611; Chi-square table, 625, 
628,703 

Frazier, Edward K., 105 
Frequency curve, 41, 82; polygon, 
41,67,85,88,91,93 
Frequency distribution, 50 ff . ; pur- 
pose of, 56; comparison of, 86; gen- 
eral. characteristics of, 97 ff . ; see 
also Distribution 

Frequency, theoretical and actual, 
431 ff. 

E^riedmaii, Milton, 521 
Functional relationship, 12, 389; lin- 
ear, see Linear relationship; para- 
bolic, see Parabolic relationship 

Galton, Francis, lines of regression, 
359 

Gamit, H. L., progress chart, 46 
Gauss, Karl Friedrich, normal law of 
error, 435 

Geometric mean, 125 ff . ; definition of, 
125; computation of, 126; charac- 
teristics of, 127, 135; as measure of-, 
central tendency, 129; as average 
of relative prices, 185; of logarith- 
mic observations, 584 
Geometric progression, 18, 28, 271, 
275, 669 


Glover, James W., 269, 437 
Gompertz curve, 272, 671 
Goodness of .fit, 447; criteria of, 276; 

Chi-square test of, 626 ff. 

Graphic method, of locating aver- 
ages, 120 ff.; in multiple correla- 
tion, 564 

Graphic presentation, 8ff.; of fre- 
quency distributions, 63; of ti,me 
series, 227 

Grouping of data, 53, 112; iiiigroiiped 
data, 109; effect on mode, 115; in 
correlation tables, 340, 354 
Growth curves, Gompertz, 272, 671; 
modified -expoiientiai, 272, 667 ff.; 
logistic, 272, 675 ff. 

Hall, Lincoln W., 288 
Harmonic equation, 579 
Harmonic mean, 132 ff.; character- 
istics of, 135; of relative prices, 
186; of reciprocal observations, 585, 
587 

Hart, Hornell, reliability of a per- 
centage, 483 

Height distribution, 87, 360 
High contact, of frequency distribu- 
tions, 443 

Histogram, 64 ff.; see also Column 
diagram 

Homogeneity, ■ 487 ; tests for, 120, 
630 ff.; in. time series, 301; ,in sam- 
pling procedure, 462, 607; Chi- 
square test of, 633 If. 

Hotelling, Harold, 378, 479 
Hyperbolic curve, 16, 28, 569 

Ideal index, 201 ; for th.e measurement 
of production, 307 

Income distribution, statistics of, 71, 
97, 102 

Independence, tests of , 630 ff. 
Independent variable, see V aria-ble 
.Index numbers, 18; nature of,. 161 ff.; 
^■^rideal,” 201; use of aggregates, 
165; of retail price, 220; of cost of 
: living, 221; of farm price, 222 ff.; of 
seasonal variation, 287 ff.; of in- 
dustrial ' activity, 312 ff., 390, 393; 
of stock prices, 390 ff . 

Index numbers of production, 305 ff,; 

. -uimdjusted, 306; adjusted, 310; 
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Federal Reserve Board index, 316; 
derived from price indices, 319 ff.; 
of industrial productivity, 321 
Index numbers of wholesale prices, 
167, 216 ff.; purpose of, 170; con- 
struction of, 180 ff., 208 ff.; aggre- 
gative type, 181; arithmetic aver- 
age type, 183; weighted, 196; of 
farm crop prices, 182, 189; geomet- 
ric average type, 185, 198; median 
type, 185; harmonic type, 186; 
comparison of types, 188; time 
reversal test, 190; weighted types, 
193 ff., 198; alternative types, 
204 ff.; commodities to be included 
in, 209 

Index of correlation, see Correlation 
index 

Index of variability, 157 
Induction, statistical, 452 ff., 598 ff.; 
nature of, 453; measures of reliabil- 
ity, 464 ff . ; generalizing from small 
samples, 598 ff . 

Industrial change, measurements of, 
322 

Inference, statistical, see Induction 
Interaction, of principles of classifica- 
tion, 688 

Interpolation, 70, 81, 277; for the 
median, 114; for the mode, 118; for 
monthly trend values, 273; in 
Fisher’s z table, 500; double inter- 
polation, 507 

Irrigation, correlated with alfalfa 
yield, 404 ff. 

Jones, D. C., binomial distribution, 
660. 

Karsten, Karl G., 278 
Kelley, Truman L., 206; reliability of 
constants, 485 
Kendall, M. G., 629' 

Keynes, J. M., rando.m sampling, 461 
Kiiiough, H. B„.569 
Knibbs, Sir George, 214 
.Kiirtosis, 100,. 137, 159 ■ 

Kurtz, Edwin, 77 

..Lag,' in,.' time 'series':. analysis, 390 ff.; 
". .changes in.. different ■ cycle' phases, 

•V' 397, ^ 


Laspeyre’s index number, 193, 214 
Law of large numbers, 455 
Least squares, method of, 246 ff., 
638 ff.; applied to linea,r .relations, 
246, 328, 354, 366, 509; applied to 
power curves, 260, 405; applied to 
logarithmic curves, 264 ff . ; in cor- 
relation analysis, 366, 373, 405 
Leptokurtie, 449 
Life table, 77 

Line of regression, see Regression 
Linear correlation, see Correlation, 
linear 

Linearity, test for, 423; by variance 
analysis, 508 ff.; see also Linear 
relationship 

Linear relationship, 14, 16, 26, 325 ff. ; 
fitting by least-squares, 246 ff.; in 
business series, 257, 268; between, 
discount rates, 348; tests for, 423, 
477, 508 

Link relatives, 204 

Logarithmic, equation, 26 ff., 563, 
569 ff., 671; mean, 128; see also 
Geometric mean; paper, 131, 227; 
deviation, 265; function of the cor- 
relation coefficient, 614 
Logarithms, common, 23 ff., 492, 572; 
use in computing the geometric 
mean, 125, 130; use in curve fitting, 
264 ff., 269; Naperian, 435, 492; 
Appendix table X, 709 
Logistic curve, 272, 675 
* 

Macaulay, F. R., 185, 244 
Malenbaum, Wilfred, 565 
Mantissa, 24 

Manufactured goods, role in price 
movements, 213 

Mean, arithmetic, see Arithmetic 
mean; geometric, fsee Geometric 
mean; harmonic, see Harmonic 
mean 

Mean deviation, 139 ff. 

Mean product, 351, 358 
Measurement of, central tendency, 

, see Central tendency; relationsMp, 
see Relationship, etc. 

Median, definition. of, 102; location of, 
..:109ff.; computation of, 113; 

. graphic , location of, , , 120 ff . ; char- 
■: ' acteristics of, .134; relation to mean 
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deviatiGEy 140; of relative prices, 
185; standard error of, 472 
Merriman, Mansfield, 90 
Mesokurtic, 449 
■Minor, J. B., 556 

Mitclieil, W. a, 93, 173, 176, 212, 
242, 303; comparison of index num- 
bers, 209; business cycles, 467, 483 
Mode, 96; definition of, 101; location 
. of, 115 ff.; grapbic location of, 

. 120 ff.; characteristics of, 135 
Moments, of frequency distributions, 
440; about the mean, 442 
MontHy trend values, 272 ff. 
Mortality tables, 80 
Moving average, 234 If. ; application 
to non-linear series, 239; measure- 
ment of seasonal fluctuations, 
285 ff . ; use in correlating cycles, 
398 ff . 

Mudgett, Bruce D., 216 
Multiple correlation, see Correlation, 
multiple 

Multiple frequency table, 289 

Napierian logarithm, 435 
National Bureau of Economic Re- 
search, 244, 320, 397; study of in- 
come distribution, 132; construc- 
tion of index numbers, 219; wStudy 
of production change, 309 
National Industrial Conference Board, 
.cost of living index, 221 
Natural number, 24, »28; table of 
squares of, 706; sums of powers, 708 
New York Census of Manufactures, 
■'..'•\30'9,317,37L. 

Non-itnear correlation, see ' Correla- 
tion, non-linear 

Non-linear relatio,nship, 404 ff . ; see 
also Parabolic and exponential func- 
tion 

Normal deviate, 437, 599; table of, 
603, 699 

Normal equations, for linear relation- 
ships, 249; parabolic, 254; of multi- 
variate relationships, 537 ff.; deriva- 
tion of, 639; formation of, 640; 
checks on, 648; Doolittle solution 
of, 654 

Normal, law of error, 98, 153, 425 ff., 
435 ff.; assumptions underlying,. 


436;' its use, 438; economic appli- 
cation of, 440 ff.; criteria 'for, 444; 

■ fitting the normal curve, 445 ff.; 
distribution, 332, 371, 458; de- 
parture from, 374, 378; computa- 
tion of theoretical frequencies, 446; 
generalization of.' .results, 448; of 
the distribution of means, 464; use 
ill measures of reliability, 464 ff.; 
area under, 437, 699; test of good- 
ness of fit of, 627 
Null hypothesis, 475 

Ogive, 80 ff., 85 

Organization of data, 51, 82, 100; in 
time series, 226 

Origin, arbitrary, 107, 351; at point 
of averages, 353, 365 
Orthogonal polynomials, 270 

Paasche’s index number, formula for, 
195, 215 

Pabst, Margaret, 378, 479 
Parabolic curve, 16, 21, 27, 270, 577; 

see also Parabolic function 
Parabolic function, fitting of, 253 ff . ; 
second degree, 260, 405; iogaritli- 
mic, 264, 269, 270; testing para- 
bolic hypothesis, 514 ff. 

Parameter, 457 

Pareto, Vilfredo, law of income dis- 
tribution, 132 

Partial correlation, see Correlation, 
partial 

Peake, E. G., 94 

Peakedness, 100; see also Kurtosis 
Pearl, Raymond, 271, 272; formation 
of normal equations, 642; logistic 
curve, 675 

Pearson, . Karl, 156, 158, .254, . 436; 
coefficient of correlation, 335; cor- 
relation ratio, 413 ff.; curve types, 
448; descriptive measures of' . .fre- 
quency distributions, 448 ff . ; ..statis- 
tical . inference, 454 ; .. Ciii-square dis- 
tribution, 618 ff., 626 
Percentages, difference between' and 
significance of, 4S3 
Percentile, 114 

Periodic fluctuation, 230; removal by 
'■ moving averages, 235; Me also 
. .. Seasonal and cyclical variation 
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Periodic function^ 21 
Persons^ Warren M., 204; analysis of 
cycle lagSj 390 ff. 

Piatyknrtic, 449 

Polynoioaial, orthogonaly 270; see also 
Parabolic function 
Populationy statistical, 453, 454, 456 
Potential series, 21 

Power series, 253; see also Parabolic 
function 

Price relative, 162; arithmetic aver- 
age of, 183 

Price, wholesale, 93, 168; index num- 
bers of, 161 ff., 167, 216 ff.; see also 
Index number; price ratios, 171 ff.; 
measurement of change of, 174; 
wholesale groups, 211; index of re- 
tail, 220; of farm products, 222; 
deflation of, 279; measurement of 
variation in, 493 

Probable error, 152, 155; of index 
numbers, 206; see also Standard 
error 

Probability, 603; principles of, 425 
addition of, 427; measurement of, 
429; a priori, 431; empirical, 431; 
normal, 439, 459, 471; normal 
table of, 699; integral, 436 
Probability, curve 98; see aiso Normal 
law of error 
Probable value, 332 
Production, statistics of, 10, 35, 40, 
43, 47; of fuel, 163, 265; of crops, 
192; as measured by index numbers, 
305 ff . ; see also List of charts 
Product-moment method, 349 fl., 
368; for classified data, 354 ff. 
Projection, of trend values, 277, 402 
Purposive selection, in sampling pro- 
. ; cedure, 462 

Quartile, 114; graphic location of, 
120 ff. ; deviation, 150 ff., 154; 
standard error of , 473 

Random fluctuations, 231; removal 
by moving averages, 241 
Random ' Sampling, 468, 461; see also 
. . Sampling ' . 

Range,: of variation,' 139, 154; semi- 
interquartile, 151 


Rank correlation, 374 ff.; see also 
Correlation, ran.k 

Rate, of interest, 30, 76, 228; of 
change, 40, 267, 278, 587; of ex- 
change, 94; averaging of, 125 
Ratio, chart, 29, 35 
Ratio, correlation, 413; see also Cor- 
relation ratio 

Reciprocals, use in measuring rela- 
tionship, 578 ff., 675 ff. 

Reed, Lowell, J., 272; logistic curve, 
675 

Reference cycles, 243, 262; correla- 
tion of, 382 
Regimen, 214, 322 

Regression, lines of, 359 ff. ; use of, 
364 ff., 367, 423, 607; coefficient of 
regression, 359 ff., 363, 479, 561, 
607; for cotton production and 
price, 387, 401; standard error of 
coefficient of regression, 479, 607, 
609 

Relationship, between income and 
auto registration, 326 ff., 352; meas- 
urement of, 325 ff., 334; between 
discount rates, 340 ff.; between 
time series, 380 ff . ; temporal, 391 ff. ; 
linear, see Linear relationship 
Relative deviations, 129; weighted, 
167 

Relative price, 162; arithmetic aver- 
age of, 183; geometric average, 
185, 198; harmonic average, 187; 
weighted average, 196 
Relative variation, measurement of, 
156 ff., 264 

Reliability, measures of, 464; of the 
mean, 464; of the difference be- 
tween means, 481, 483; of the me- 
dian, 472; of the standard deviation, 
473; of the coefficient of correlation, 
474; index of correlation, 477 ; coeffi- 
cient of regression, 478 
Residuals, 247 

Residual variability, see ¥ariab,illty, 

residual 

Retail price, index of, 220 
. Richardson, A. H., 49 
Rietz, H. L., 143 
Robertson, R. D., 405 
Robinson, G., 88, 274, 465 
Root-mean-square deviation,; 146, 
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'276j 330 , ' 416;, m^e' ako ■ Standard 
.deviation 
I'iiiloiii P. tJ.,' 485 

Sampie, size of, 117; estimates from, 
146; ill constructing index numbers, 
206 

Sampling,, problemi of, 452 ff., 460; 
rando.m, 458, 461 ; generalizing from 
sinali samples, 598 ff.; errors' of, 

, 293 ,. 447; see also Standard error 
Sasuiy, Max, 270 
Scale, for curve reading, 39 
Scatter, 99 ; degree of, 137, 334, 409, 
414, 646; see also Variation 
Scatter diagram, 326, 328, 348, 370, 
416 

Scott, Frances V., 219 . 

Seasonal variation, 230, 284 ff,, 380; 
removal by moving a^^erages, 235; 
measurement of, 287 ff.; adjustment 
of, 317; test of significance of, 522 ff. 
Secular trend, 229, 380, 487 ; of cotton 
production and price, 383, 385; 
measurement of, 231 tY,; i-epresen- 
tation by moving avc^rage, 234; 
by mathematical curves, 244 ff., 
667 C; of business s<‘ries, 257 ff.; 
selection of curve, 274 if. 

Selection of curve of trend, 274 
'Semi-interqiiartile range, 151 
Bemi4ogaritlimic charts, 28, 264' ; 
advantages of, 40 

Series, periodic, 21; potential, 21; 
continuous, 75 

Sheppard, W. F., (uaTOction for 
grouping, 150, 442 ff.; table of 
normal areas, 436 

Shewhart, W. A., 49; distribution of 
the standard deviation, 600 ff. 
Significance, tests of, 464 ff, ; see also 
Standard error 
Significant figures, 485 
Sine curve, 21 

Skeraess, 96; measures of, 100, 122, 
137 ff., 157 ff., 449; of geometric 
series, 129; of the standard devia- 
tion, 600; of the correlation co- 
efficient, 610 

Slope, 293; of regression line, 336, 
350, 359, 361; see also Regression 
coefficient 


Smoothing of curves, 69 .if;, 76, 117 
Snede-cor; George W., 449, 688 
Snyder, Carl, 229 
Spurr, W. A., 293 

Squares of natural numbers, table of, 

706 

Standard deviati.on, 145 ff., 330, 
371, 416; characteristic .features of, 
155; use in adjusting index , iiiim-. 
bers, 311, 393, 395; i,n. terms of 
moments, 443; about .the nieaiis 
of arrays, 418; use of, in variance 
anal 3 ’'sis, 491, 494; sec also Standard 
error 

Standard error, of the binomial dis- 
tribution, 434, 660; of the mean, 
464, 664; of the difference of 
means, 481, 483; of tlie median, 
472; of the standard deviation, 473; 
of the correlation coefficient, 474, 
545; of the correlation index, 477; 
of the regression coefficient, 478; of 
the partial correlation coefficient, 
560, 615; of tlie 2 function, 493, 
615; limitations of above measmes, 
486 ff. 

Standard e.rror t.)f estimate, 330 ff. ; 
computation of, 333, 338, 370, 388, 
401,. 406, 590; sl'iort-ciit calculation, 
34:6, 354; of parabolic fiinctioiis, 
410; significance of, 349, 371; about 
line of regression, 480; correction 
of, 413, 542; in multiple co.rreIa- 
tion analy»sls, 534, 54.1 ff. ; of loga- 
rithmic functions, 571 ff,; in ratio 
terms, 573; in reciprocal terms, 
581; zones of estimate, 590 ff. 

Starr, G. W., 31.2 
Statistic, 457 

Statistical descripti,on, see^ Descrip- 
tion 

Statistical induction; see Induction . 
Steinmetz, C. P., 256 
Stewart, Etlielbert, 84 
Stock price cycles, relation to busi- 
ness activity, 390, 397 
Straight line, fitting of, 246; see also 
Linear relationship 
Stratification, in sampling procedure, 
462 

Stratified purposive sjtnqjiiiig, 463; 
standard error of, 472 
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Student/^ standard error of the rank- 
correlation coe&cientj 479; stand- 
ard error of the mean, 699; distribu- 
tion of the' standard deviation, 
'600ff. 

Sturges, H. A., 57 
Symbols, glossary of, 691 ff. 

Symmetry, 100; degree of, 120; see 
also SkeTOess 

Table, of areas under the normal 
curve, 699; Fisher t table, 603, 

, '700; of sigiiificaiit values of the cor- 
' relation coefficient, 612, 701; of 
' relations of the correlation coeffi-. 

■ cient to the z function, 702; of the 
distribution of z, 704-5; of the. 

■ powers, of natural numbers, 706,. 
708; of common logs, 709 

Tabulation of data, 51, 62; in cor- 
relation tables, 341, .354, 415 ■ . 
Tendency, ce.ntral; .sec Averages and 
Central tendency 
Thompson, F. L., 292 
Time re-versal test, 190 
Time series, charts, 33, 48,' 50;" 

analysis of, .225 ff,,. 295; graphic 
. represe.!itati<).n, 227; . removal -of 
cycles, 234; fitting a .line to, 252; 
measurement , of seasonal fluctua-' 
tion, 284 ff . ; , of cyclical ,. fluctuation, 

: 284;' measui'ement of relations be- 
tween, 380 ff . ; see also Correlation 
of time series 
Tolley, 537, 652.. 

Trend, 262; of price movements, 170; 
:of monthly values, 272; selection of 
curve , of,.', 274 ff.; nieasurement of,' 
;225 ff.; secular, see Sec.ular. trend. 

TJngrouped data, 109; product , mo- 
ment method for, 352 : , . 

Uniformity of nature, principle of, 
4.57 ■ 

Unweighted .index number,. 184 ■ 

U. S. Bureau of Internal Revenue, 
326 

U. S. Bureau of Labor Statistics,'.' 
statistics of fud production, 164; 
index of whuiesale prices, 168, 172, 
176, 212, 216 ff., 282; index .'number-. 


used, 193; index ' of ' retail prices, 
220; cost of living index, 221 
U. S. Department of Agriculture, in- 
dex of farm, prices, 222 

Variability, measures of, 490 ff 560,; 
between classes, 494 ff.; absolute, 
586; residual, 526, 689; see also 
Variance and variation 
Variable, 11; relations between vari- 
■ ables,- 325 ff., 359, 360 
ariance, analysis of,' 490 ff.; z test 
of- difference in variability, 492, 
506, 513, 517; in testing varia'bility 
between classes, 494; in the meas- 
urement of relationship, 501 ff., 
519; in testing llnearit 3 % 508 ff.; 
curvilinear hypothesis, 51 4 ff.; test- 
ing seasonal iiiictuatioii, 522 ff.; 
in testing the iiiultiph' correlation 
coefficient, 545; In {t'shiig signi,fi- 
cance of prineiplos of riassifi cation, 
681 ff. 

Variation, 97; measures of, 99, 137 ff., 
330; absolute, 138; ra irnpari.^-en of 
mea,sures of, 153, 155; measures of 
difference in, 4!10ff.; c‘oeflirient of, 
'156; ill price relatives, 171 ff.; 
within and, beiwetm array’s, 502; 
see also Seasonal and cyclical fluc- 
tuation 

Verhulst, P. F., 272 

Wage statistics of, 96, 103, 105, 111, 

124 

Wahr, George, 437 

Walsh, G. M., 130, 20l;::ratio variabil- 
ity, 596 

Weighted average, 104, 106; of rela- 
.tive- .prices, 106;.. geometric., 125; 

■ moving averagx% 244 
Weldon, W. F. R.,, dice experinieiit, 

432, 618 

Wheat, exports of, 33; yield cctrre- 
lated with fertiliiier,. 415 
'Whipple, G. a, 87 
Whittaker, E. T., 88, 274, 465 
Wholesale price, 21 Iff.; index of, 
216 ff . ; seB: also Price 
Working, Holbrook, harmonic mean, 
- 588 
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YateSj F., 688 % test of variability, 492, 506, 514, 

Yule, G. U., 60, 41S; Chi-square fre- 517,* tables of, 704“5; staudard error 
queueies, 622, 629 ' ’ ■ ■ of, 615 

z trausformatioE of correlation coeffi- 
2one, of estimate; see Estimate, Dis- cient, 613 ff., 702 
persion 




